I talked once to an AMD project managery guy. He told me how they have several teams in different countries. And for each feature/bug different managers have to "bid" with weekly resource estimates. Bid too high and you dont get enough work for your peoples. Bid too low and they end up with unpaid overtime. Rinse and repeat.
And these are kernel developers...
EDIT: to be clear, I am rooting for competition to Intel and NVIDIA. I just dont think this kind of culture can work on the software side.
They should rather hire the economist as their boss.
If this is getting closer I think they need a better tape measurer.
With 16GB of aquabolt HBM2 on a 7nm node there is no way AMD is losing less than $150 per card without even accounting for opportunity loss.
In fact this card is the reason why the head of the RTG was fired when suggesting to sell it at $749 at a loss.
The 2080 for all intents and purposes is a mid range die.
The only thing AMD is competing on atm is price and for them it’s a lose lose situation until Navi comes out and that is only if Navi could actually be competitive above the $300 price bracket as it’s a Polaris successor its not clear if AMD will have anything on the level of the 2080 not to mention the 2080ti based on Navi.
AMD had to drop the prices of VEGA to around $300 due to the RTX 2060 they are likely also losing money on that front, they are losing money on Radeon 7 and hopefully they will finally make money with Navi.
The Radeon VII is ultimately just a repurposed workstation card given flagship treatment. It's not where AMD's business really is, ever since they went down the route of semi-custom and small dies. While pushing RTX could renew Nvidia's advantages, the only game console they're on these days is the Switch - no raytracing to be seen there. Developers accommodate the Geforce cards for PC releases, but AAA going console-first precludes designing most content around RTX. They've made a lot of moves to try to repurpose high end graphics for other markets, but it looks like it's going to be very rough for Nvidia in the next few years if they can't find a "blue ocean" that needs both speed and programmability.
On the flip side, weak console performance is making the PC become the default platform for AAA games.
>but it looks like it's going to be very rough for Nvidia in the next few years if they can't find a "blue ocean" that needs both speed and programmability.
I don't see how. According to the steam hardware survey the top 10 GPUs used were all nvidia.
It seems that AMD will have competition on the APU side very soon, and I really hope Intel would pull it off and actually be competitive in the discrete market as well in 2020 or 2021 and that their GPU adventure isn't going to get a backroom abortion.
As it stands $500-700 gets you 1080p gaming thats better looking than current gen consoles; this is the entry level and its already several notches above APU.
Few people build their own PC. Many more play games.
I have heard that there are some prebuilts that are worth it. But you won't be able to know which ones are good until you had experience building your own.
Also what's up with kids these days not setting textures to potato mode to eke out the last FPS, like we did with Quake? :P
It would help if you could actually buy them. The external graphics cards were great price-to-performance, but were basically impossible to find for a while (since they were the best at crypto/mining). So most of the people who wanted one, already had to settle for something else.
And all the OEMS are largely locked into Intel roadmaps, so even though AMD integrated GPUs are awesome, it's very difficult to find a laptop that ships with it. A laptop that could be an ultrabook all day, but still game at console-level graphics (1080p 30fps on low settings) would be a huge hit, and the product already exists. Most companies just aren't selling any of it yet (The Dell XPS 9575 being the only widely-available Windows unit I'm aware of).
So, to buy an AMD-powered laptop, you basically have to buy weird low-end hard-to-find laptops from HP or whatever -- even though the market that would best be served by these devices is the mid-range to high-end (ThinkPads, XPS, Spectre, etc).
APU with Radeon HD 6290, good DirectX 11 support, rebooted driver on GNU/Linux with loss of features from fglrx.
This is a MI50 with soldered display outputs since the MI cards lack them as they have no rasterization support in the driver.
My bad! :)
"With a $5000> card (MI50) AMD is forced to sell at a loss to be able to meet the same performance levels NVIDIA had for the same amount of money ($700) 2 years ago."
Source? What 2 year old nvidia card is beating the instinct mi50 at floating point math? Does it even support fp64? Nothing about that sounds even a bit true.
"The 2080 for all intents and purposes is a mid range die."
The second fastest gpu in the world is 'mid-range'. Yeah right. Not even close unless you only consult with millionaires for your part info.
"AMD had to drop the prices of VEGA to around $300 due to the RTX 2060"
Vega 64 competes with the 2070, it blows the 2060 out of the water. Also the price has gone down? Yeah, that's what happens when tech gets older. Prices drop. They always have and always will.
Just curious, but how did you come up with that estimate?
We’re about 1 month away from NVIDIA’s next quarterly report it will be really interesting to see how much revenue they got so far from Turing especially considering the 2060 was released only 1 month prior to the date their revenue will be reported on.
Yeah, there are higher-end XCC Intel chips or higher end NVidia chips. But by any practical measurement, things pushing 500mm^2 or bigger is downright massive.
Radeon VII is ~350mm^2, but on a very expensive 7nm process. Hard to tell what "high end" or "mid range" will be at 7nm right now, since there aren't too many chips being made to compare against.
It doesn’t change the fact that the TU104 is a mid range die as far as the Turing lineup goes doesn’t matter if it’s 500mm^2 or 5M^2.
Radeon 7 is a fairly large die for its processes on a very expensive node using the fastest memory around which barely beats the 1080ti and brings next to nothing new to the table over Vega, heck i doesn’t bring that many new things over Fiji but at least they finally have dot product support.
The simple reality is that for gaming the 2080 is still a better pick if only because of the RTX features.
For compute the 2080 might edge it depending on the workload and especially for DL/ML due to the craptastic state of AMDs ecosystem even today.
For HPC as in FP64 then the Rad 7 is worse perf/$ than the Titan V you can buy a Titan V 24GB for £2400 today and get 7 tflops of FP64.
The only case where the Radeon 7 would be better than the 2080 is likely in a small subset of GPU accelerated software that properly supports AMD GPUs e.g. blender.
I think there's a strong argument to future-proofing 4k gaming experiences. I think the 1080 Ti is a better pick, but the 1080 Ti is sold out in the US Market. The 2080's 8GB of VRAM is usable today, but I'm not entirely sure how long it will be before 4k games blow through that.
1080 Ti had 11GB of VRAM, Radeon 7 is a bit overkill at 16. But 8GB is solidly a mid-range VRAM size.
> The only case where the Radeon 7 would be better than the 2080 is likely in a small subset of GPU accelerated software that properly supports AMD GPUs e.g. blender.
FP32 performance is also used in Adobe Premier (video editors). But yo man, 3d renders (like Blender) can take hours under normal circumstances. Its a solid market to try to accelerate.
And 8GB is more than enough, heck if AMD memory management worked as in dev's would've used it then 4GB is enough for 4K even today, there are plenty of GDC talks about it and how some games on average only accessed 3GB of memory for a given level the biggest ones were from Doom and Witcher 3 iirc.
This is how the Xbox One X can do 4K even if it's at 30fps with only 12GB of RAM shared between the GPU and CPU and how even the PS4 pro can do 4K with 8GB and in all cases when they can't it's not a memory limitation but rather a fill rate / horsepower issue.
I'll take NVIDIA's stellar memory management and compression and tiled rendering over AMD's 1TB of memory bandwidth and 16GB of memory for 4K gaming, and 4K gaming on PCs is well meh, I'm actually regretting getting a 4K monitor and a 1080ti SLI setup something which i will remediate once the 1440P FLAD HDR Gsync monitors come out to the market.
You're right about Adobe being bad with OpenCL, but its just an example. But my overall point is that the 2d video editing community is mostly about FP32. A high end FP32 card caters towards 3d renderers, 2d video editors, and others in the professional marketplace.
Davinci Resolve users seem to get a bigger benefit out of Vega cards than 10xx series NVidia cards. So it depends on your video editor.
Aside from scientific compute, I'm having issues figuring out where FP64 support is an issue. All of the applications I'm aware of are FP32. There's a clear difference in the rendering of heavy effects (Optical Flow based Resampling) or 3d editors, so a good GPU really makes a difference in those cases.
> And 8GB is more than enough, heck if AMD memory management worked as in dev's would've used it then 4GB is enough for 4K even today, there are plenty of GDC talks about it and how some games on average only accessed 3GB of memory for a given level the biggest ones were from Doom and Witcher 3 iirc.
XBox and PS4 Pro are bad examples to use. Those systems have far weaker GPUs, and the textures aren't truly 4k.
If you enter a game with 4k texture packs (or 8k texture packs, like Star Citizen), you can expect your memory usage to spike significantly. 8GB is still sufficient, but gamers are expecting better and better textures.
4k + 10-bit HDR textures take up a lot of room.
Textures are the same as v.high/high on PC the only thing that some PC games add is an uncompressed texture pack which are simply stupid to begin with.
Xbox One X is a great example of a system that can push 4K or near 4K gaming because it does so pretty darn well.
The memory capacity is really not an issue with 8GB, in fact the only cases where the 2080 really pulls a head of the 1080ti is in higher resolutions especially 4K.
>if you enter a game with 4k texture packs (or 8k texture packs, like Star Citizen), you can expect your memory usage to spike significantly. 8GB is still sufficient, but gamers are expecting better and better textures.
Most textures in a game are well beyond 4 or even 8K regardless of the resolution, heck a simple texture atlas is huge.
Star Citizen is a crap game to use as an example as both the textures are overall simply horrid and it make PUBG looks like a well optimized title.
Overall the 2080 gives you RTX which is RT + Tensor Cores which will become very important as WinML/DirectML games begin to ship since NVIDIA both assisted Microsoft in defining the spec and it's essentially tailored for their Tensor Cores, you get concurrent Int/FP execution per ALU which will give NVIDIA a huge boost as integer shaders (e.g. compute shaders with heavy bool logic) will become more common and you get a more mature ecosystem, better encode/decode and various other small QoL improvements.
For gaming specifically the Radeon 7 doesn't have anything to offer other than to those that care more about the brand than what they get for their investment.
I really miss the days of the HD4000 series or the days when the R300, and sadly I don't see AMD getting out of this cycle any time soon, Fiji was late and underwhelming, Vega was late and underwhelming, and now Radeon 7 is late and underwhelming.
If Navi is another mid range GPU will see 15%-20% improevemnt over Polaris if we're lucky, once we get closer to Feb 7th I'm pretty sure we'll see that in 1080p there is likely no difference between VEGA64 and Radeon 7 and the only place we see the difference is in 4K due to slightly higher fill rate and more bandwidth since AMD still can't get delta compression to work correctly.
It's "enough" because it has to be. Game developers need to constantly pay attention to these budgets because people who play their games have hardware with limited performance. If everyone had more of a resource then none of them would say no to that particularly now that GPU compute is becoming more popular in games.
Doom had a similar talk, resource management is a key factor in game development.
GPU compute is a completely different issue.
4k isn't that impressive for gaming with a monitor. Most people can barely tell the difference between it and 1440p, but there's a huge performance difference. For VR it does matter though.
AMD has clearly placed a lot of R&D research into that approach and it would allow Ryzen 3000 etc to share a package with a Navi GPU die.
It would also explain the mysteriously absent second 7nm die in the Ryzen announcement this past CES.
CPUs have a long-history of NUMA / multiple sockets. GPUs history of "Crossfire" and other such technologies is far weaker and has far worse support. It is less clear how software is supposed to represent asymmetric access to memory.
How do you build a GPU with a single view of memory, when each die is likely connected to its own set of RAM? Especially with latency being as big of an issue as it is already.
I'm sure AMD is working on the problem, but I don't personally expect it to be solved so soon. Yeah, AMD Infinity Fabric is a nifty 50GB/s chiplet bus. But HBM2 is 1TBps, so Infinity Fabric is simply too slow to share data between hypothetical GPU chiplets.
The EPYC Zen 2 cores have increased L3 and then communicate to RAM via the monolithic I/O core. It's very different than traditional NUMA in terms of locality but it's also likely that latencies have been increased.
A GPU with the same design would look monolithic as far as software is concerned since modern GPUs are already subdivided into hundreds of compute cores as it is.
As stated earlier: the Infinity Fabric links on EPYC (not necessarily EPYC 2... but on original EPYC 1) were like 50GB/s. Assuming EPYC2 has a similar fabric, we're still only looking at technology that can push 50GB/s to any core.
GPUs like the Vega64 have 500GB/s bandwidth, and Radeon 7 has 1000GB/s bandwidth to main memory. Not cache, but literally its main memory bandwidth. Its completely, and unfathomably huge, compared to CPU architecture.
Building a 1000GB/s crossbar between RAM and GPU compute units doesn't seem like an easy problem to me. 50GB/s Infinity fabric was done before (socket-to-socket communication), AMD just shrank the tech down to fit on chiplets.
InfinityFabric really has nothing to do with this, especially when locality has such a strong impact on possible bus-widths.
HBM is the topology btw. Each 1024-bit connection (1024 pins!!) gives 250GB/s between the chip and RAM. 4-connections to RAM to one chip leads to 1000GB/s.
With a chiplet design, you'd need 2 or more chips connected to RAM. The immediate design of 1-chip == 2 memory controllers + 1-chip == other two memory controllers (500GB/s per chip) now halves the bandwidth.
So now you have to somehow, allow the two chips communicate at 500GB/s, bidirectional btw. (500GB/s from chip0 to chip1, and 500GB/s from chip1 to chip0). Even then, this bidirectional link will have horrible latency, since it will be Chip0 -> Chip1 -> HBM2 stack.
So it just raises questions how anything would work at all. It just doesn't seem like an easy problem to solve to me. But hey, maybe I'm wrong and AMD got it figured out. All I'm saying is: no one has demonstrated a chip-to-chip bus that keeps up with HBM2 speeds.
So the idea that bandwidth "halves" with any intermediate step is clearly wrong since the bandwidth available to any compute core is already a fraction of the total available.
yes and no. in a product marketed toward consumers, the price tiers have to be at least somewhat informed by what people are willing/able to pay. a boxster isn't an entry level car just because it's the cheapest one porsche makes.
I also feel like graphics cards are one of those weird things that command a lot of brand loyalty, so it's probably going to take more than near-performance-parity to move the needle.
Because AMD is a minority and game developers know it, driver-related bugs that would be priority 1 to fix if it affected nVidia users are bottom priority for AMD users. The AMD midrange cards are wonderful in terms of the hardware for certain displays and competitive in every way from a hardware perspective. Especially if you are looking for 60fps+ @ 1080p you can get awesome deals by going with AMD.
Just be prepared to occasionally have issues like BSODing whenever a game renders a lot of white textures like snow scenes for six months at a time unless you downgrade the drivers to a specific version, which will then be fixed in a driver update and broken again if you are an AMD user. Or to occasionally have driver crashes that developers acknowledge and don't fix for months. And to fuss with shader cache settings and things like that because those settings are busted for AMD for some specific game you want to play.
It's not even necessarily that the drivers suck. They don't suck. They're fine. It's that the developers don't have a big incentive to fix bugs that only impact a smaller segment of the market. While it's not like there aren't a lot of nVidia only bugs that crop up, because of its market share, those bugs are almost always a higher priority to fix and you really notice it if you have used both GPU manufacturers at the same time over a period of years or have used both at various times.
Most of the nVidia tech gimmicks usually suck and are uninteresting (HairWorks, physx, realtime ray tracing) and those that don't suck are usually matched quickly (Gsync). But the marketshare alone is sort of a perversely positive feature because it just means you are on the same drivers as the majority of the market so bugs that impact you are a higher priority for developers to fix.
This isn't even a gimmick though. This is the future, because our current way of doing shadows and lights will keep increasing in complexity, but will never be good enough. Whether nvidia's solution will work is a different matter, but this most certainly is not a gimmick.
Over the next 3 years and the next generation of cards, absolutely, great technology.
I can even link you to a video from a Youtuber sponsored by nVidia that essentially says that it is a nice-looking gimmick that will be nice for cinematic single player games that isn't that useful for fast paced multiplayer games in which most gamers will pick performance over image quality most of the time. DLSS because it is a performance feature is in a different category. Like Hairworks RTX is a tech demo type feature that most users will turn off to get dozens of FPS more.
The case is im not that picky when it comes for graphics. Ive seen too many games with wonderfull graphics and just plain boring.
Nice optimum today are monitors with something like 2560x1440 and up to 144Hz adaptive sync. And you don't need Vega 20 with crazy expensive 16GB VRAM for that. Hopefully Navi will fit that use case well and will have less availability issues than Vega 10 (56/64) cards.
I've being using the same RX 480, and it produced ~40-50 fps in TW3 at 1920x1200 on max settings (no hairworks) in Wine+dxvk on Linux. Something like Vega 56 already hits 60-80 fps in the same setup. 4K would be way too heavy for anything.
Also, as the above video suggests, better antialiasing obscures resolution issues. Sure, if you don't use it, you need more resolution to compensate.
> max settings
I don't remember presets in TW3, but I bet there's MSAA and various heavy shader effects enabled.
Unlike TW2, it's not using any super crazy double pass anti-ailasing which they called "ubersampling", so it's reasonable on max, even if demanding. But having that and 4K is already too much for any single GPU :)
> up to 144Hz adaptive sync
you are agreeing with him...
In my personal experience, I need every watt of performance out of 2080ti to approach achieve a consistent 180-240hz.
Not really. For 2560x1440 resolution you don't need so much VRAM (that's what makes Radeon 7 so expensive). So it is an overkill and not very practical for gaming. It's probably a really good 3D rendering card though.
> I need every watt of performance out of 2080ti to approach achieve a consistent 180-240hz
I think 144Hz is more than enough. More would be an overkill as well. And Vega 64 and upcoming Navi should handle 60-144 fps well at 2560x1440.
However, your experience is far from an objective characterization of competitive gaming. 60 FPS spikes are called "cancer" in such circles.
It's overkill, because most GPUs don't have anywhere near that amount of memory and performance, but if they did then games would use up all of it.
4K gaming might be a red herring on small desk monitors, but more and more of us are gaming on large living room TVs with PCs plugged in next to consoles.
If you are OK with low framerate, then sure, 4K can be appealing. But higher framerate improvements are generally quite noticeable, so console makers try to downplay them.
I'm certainly interested in AMD's Linux "ROCm" push. I really think the programming model there is relatively easy to understand, but there are major flaws in the documentation and implementation.
For example, OpenCL 1.2 on ROCm 2.0 isn't stable enough to run Blender Cycles. Yes, you can render the default cube, but very slowly. On a real scene, Blender Cycles on OpenCL ROCm can take 500+ seconds to compile, and the actual execution seems to hang (infinite loop and/or memory segfault, depending on the scene) on anything close to a typical geometry.
Note that Blender's OpenCL code is explicitly written for AMD's older OpenCL (AMDGPU-Pro OpenCL implementation). Blender has a separate CUDA branch for NVidia cards. So OpenCL ROCm is at very least performance-incompatible with OpenCL AMDGPU-Pro. The Blender OpenCL code probably has to be rewritten to work (ie: not infinite loop), and maybe even become efficient on OpenCL ROCm again.
AMD's hardware is fine (not as power-efficient as NVidia, but performance is fine, in theory). But the drivers / software stack is clearly immature. Even as ROCm has hit a 2.0 release, these sorts of issues still exist.
AMDGPU-PRO with OpenCL1.2 is workable, but feels old and cranky. (OpenCL 1.2 was specified in 2011, and is missing key features. Its atomics model is incompatible with C/C++11, its missing SVM and kernel-side enqueue... etc. etc.)
AMDGPU-PRO OpenCL2.0 is theoretically supported, but is still unstable in my experience. ROCm OpenCL (both 1.2 and 2.0) is still under development, but doesn't seem to be ready for prime-time yet. (At least, with Blender 2.79 or 2.80 Cycles is any indication).
AMD HCC seems usable, but there aren't many programs using it. AMD HIP is an interesting idea but I haven't used it.
I know NVidia has driver issues / software issues. But CUDA Code written 5 years ago will still have similar performance / implementation if run on today's cards, on today's software stack. I'm not sure if the same is true for AMD's OpenCL code (between AMDGPU-PRO OpenCL1.2 and ROCm 1.2).
Long story short: the only mature AMD OpenCL compute platform seems to be OpenCL 1.2 on AMDGPU-PRO. Fortunately, it also seems like AMDGPU-PRO will work for the foreseeable future, but AMD really needs to clarify its platform to attract developers. (Ex: prioritize testing of ROCm OpenCL to ensure performance-compatibility with existing OpenCL 1.2 code written for AMDGPU-PRO)
Which is partially why ROCm exists. Its the OpenSource implementation of AMD's driver stack. AMD seems to have indicated that the Open Source ROCm drivers are the way of the future. Its a sentiment I can certainly get behind (and AMD has even pushed the ROCm driver stack to the Linux Kernel proper).
But ROCm isn't ready quite yet. So I think in practice, people will still be relying upon the older AMDGPU-Pro drivers. At least for the next year.
Nvidia made no pretense, you downloaded a thin open source shim, typed "make" and then it worked and was fast. And 99.9% proprietary blob.
not my experience
Edit: added timestamp to video
Essentially this type of thing is the only good way to improve realism. And it allows rendering to move eventually from what is basically an aggregation of different types of lighting approximations towards a more unified global illumination system. It is going to require engine rewrites and a few more generations of hardware improvements to fully achieve that, but it will enable real-time rendering close to today's cinema quality and also streamline game development by removing the need for so much asset preparation related to lighting.
Anyway I think that AMD absolutely needs to catch up in this area. Even though many people related to gaming or programming may have trouble recognizing the relevance because all of the lighting hacks are firmly baked into the culture and accepted. All those hacks will gradually become obsolete.
In fact, AMD has some tricks up its sleeve. AMD Zen has two AES pipelines (while Intel Skylake only has one), so vectorized AES code is faster on AMD.
There are some performance issues with vpgatherdd instructions, but Intel has those as uop emulated code too. So both Intel and AMD are equally to blame there.
My "issues" with AMD CPUs are relatively tame. AMD's profiling tools are weaker than Intel's. (Ex: AMD has "Instruction Based Sampling" while Intel's "PEBS" (Precision Event Based Sampling) are a bit easier to use. True 256-bit execution would be nice, but its not a major hassle in most cases IMO: AMD's back-end is very thick, so you can still get a lot of ILP out of AVX2 instructions.
Isn't HPET is just a basic system timer? (That is definitely present on Zen.) You might be thinking of something else?
> 256-bit execution would be nice, but its not a major hassle in most cases
Coming with Zen 2 anyway :)
You're right. I got the names confused. I've edited the post above to use the proper "PEBS" term.
Intel's "Precise Event-Based Sampling" is what I was trying to talk about. Intel's PEBS can precisely tell you where a branch-mispredict happens.
AMD's default event timers are inaccurate: your branch mispredictions will be all over "add" instructions, and other unrelated stuff. This is because a CPU is looking at roughly ~100 instruction windows (between the pipeline, decoder, and retirement... there's a lot of inaccuracy in determining "where did this branch misprediction happen??").
So when trying to track down a branch-misprediction on AMD systems, you have to switch to the harder to use "Instruction based Profiling" mode. Intel has a simpler PEBS switch which is easier to use IMO.
This slow scaling of 128->256->512 bits in the instruction set is more or less a solved problem in the GPU space with shader compilers and AVX would mostly be redundant with GPUs if it weren't for the memory bandwidth constraint.
ie: When it comes to vector processing, go big or go home.
AVX/SSE are a compromise from back when CPU die space and bandwidth was more precious. Now that we have 8-32 cores on a die with a good bus between them, it seems like duplicating those AVX units 8-32 times is less optimal.
AVX's main advantage is that it is roughly 1-cycle away from your main registers, and 4-cycles away from L1 cache. Talking to and from L3 cache is on the order of 30 to 40 cycles, an order of magnitude slower.
If your workload fits inside of 64kB, AVX is incredibly beneficial. If your workload fits within 8MB (L3 cache), you're starting to look at a point where maybe you should pipe that data to the GPU instead.
A GPU call over PCIe is under 5 uS / 5000 nanoseconds, with a bandwidth of ~15GB/s. GPUs are certainly farther away than L3 cache, but if you're pushing L3 cache levels... you're getting close to the GPU anyway.
128-bit SSE code is perfect for representing Complex Numbers (Two double-floats). 128-bit is also great for a x,y,z,w 32-bit vector.
GPUs are the "go big or go home" architecture. AVX's primary benefit is latency.
Here's the simplest godbolt example I could think of to illustrate, summing a string of fixed length and non-fixed length:
You can see that the most recent GCC fails to use the AVX512 zmm registers even after being configured to do so (afaik) and also fails to use more than 4 registers. Clang does better, using zmm* and all the registers.
But in both cases, the amount of code generated is quite large. If you compile with -Os instead of -O3, no vector instructions are used for some reason.
So when you load this code, no matter what, you're loading a bunch of instruction cachelines, which will destroy most of your latency gain unless the input is very large. And even if your input is large, you'll fault that data anyways.
So what's the point of doing this on the CPU again?
Changing the code to 32-bit ints results in the "vpaddd zmm0, zmm0, ZMMWORD PTR [rdx]" that you'd expect from the auto-vectorizer.
> You can see that the most recent GCC fails to use the AVX512 zmm registers even after being configured to do so (afaik) and also fails to use more than 4 registers
In the 32-bit vpaddd code... "vpaddd zmm0, zmm0, ZMMWORD PTR [rdx]" becomes a new zmm register allocation in the register-renamer (due to a cut dependency). I doubt any code would be any faster.
In any case, its not about "the number of registers used", proper analysis is about the depth of dependencies. I'd only be worried about small-register usage if the dependency chain is large (which doesn't seem to be the case).
EDIT: I had some bad analysis. I've erased the bad paragraph.
So it seems pretty good in my eyes.
So already it's better than "pretty good" and it's still verbose.
But neither of these compilers are able to optimize for code size while using AVX, so the ((code size)/64byte cacheline) * (~100ns loads) will still kill your performance on any data set that's smaller than a few kilobytes.
You specified the "int sum", so that means "sum" needs to follow 32-bit overflow semantics. I don't think it is possible to do 8-bit
> But neither of these compilers are able to optimize for code size while using AVX
You can totally do that. You're specifying -O3, which is "speed over size". If you want size, you want -Os, and then use -ftree-vectorize to enable the auto-vectorization.
The results are often counterintuitive, because it's hard for a human to account for how the program _actually_ gets executed, which instructions go in which order, port utilization, data dependencies, CPU bugs, effects of alignment (or, often, lack thereof on Intel CPUs), micro-op caching, etc etc. For all you know GCC deliberately avoided the use of AVX512 registers because AVX512 causes the CPU to throttle the clock.
I've also found that ye olde -O2 flag recommendation which says -O3 is likely to be slower in practice because it produces bulkier machine code, nearly never holds up anymore.
That said, I concede that autovectorization is not perfect. I can almost always beat it if I really want to, sometimes by quite a margin. But what you're proposing is unlikely to help matters, especially for smaller inputs that you're concerned about.
You are welcomed to load up the godbolt example and try and create better assembly using O2 or whatever else you can think of. You can build the code on your machine and benchmark it too.
But at the end of the day, the compiler output sucks. It's clearly optimized for synthetic benchmarks that chug through large datasets using a tiny amount of code that can fit in L1 cache even after the compiler blows up its footprint.
But my point is that these large datasets with tiny processing kernels (like a video codec for example) are much better suited to different kinds of processors.