CUDA is the only reason I have an Nvidia card, but if more projects start migrating to a more agnostic environment, I'll be really grateful.
Running Nvidia in Linux isn't as much fun. Fedora and Debian can be incredibly reliable systems, but when you add an Nvidia card, I feel like I am back in Windows Vista with kernel crashes from time to time.
My Arch system would occasionally boot to a black screen. When this happened, no amount of tinkering could get it back. I had to reinstall the whole OS.
Turns out it was a conflict between nvidia drivers and my (10 year old) Intel integrated GPU. But once I switched to an AMD card, everything works flawlessly.
Ubuntu based systems barely worked at all. Incredibly unstable and would occasionally corrupt the output and barf colors and fragments of the desktop all over my screens.
AMD on arch has been an absolute delight. It just. Works. It's more stable than nvidia on windows.
For a lot of reasons-- but mainly Linux drivers-- I've totally sworn off nvidia cards. AMD just works better for me.
As a counter-argument, I ran Arch Linux + nvidia GPUs + Intel CPUs between 2012 and 2020, and still run Arch + nvidia (now with AMD CPU) to this day. I won't say it has been bug free at all, but it generally works pretty well. If you find a problem in Arch that you cannot fix without reinstalling, you do not sufficiently understand the problem or Arch itself. "Installing" Arch is refreshingly manual and "simple" compared to the magic that is other Linux distros or the closed source OSes.
I tried using an Nvidia card with OBS to record my screen and it kind of freezes in Wine. I switched from x11 to Wayland and now Wine shows horizontal lines (!) and performs like crap.
Even my 4GB RX 570 from years ago gives a better experience doing this. You just install OBS from flathub, Wayland works, everything works without any setup or tinkering. You click record and you can record your gameplay footage.
I'm sure that I could have fixed it, but I gave up after spending multiple evenings on it. Have you ever spent hours debugging a system exclusively in text mode? It isn't fun. Reinstalling the OS takes less than 30 minutes. It's a clear choice for me
Yes in fact, I have spent hours debugging a system from the console. links/lynx is a godsend. I agree though, reinstalling is certainly easier. This is more of a philosophical argument than a practical one. I installed Arch to really learn Linux, not just to get work done. If I just wanted to get work done, I'd have used Fedora, Ubuntu, or Debian.
I ran a laptop with the swappable dedicated Nvidia and integrated Intel GPU for a decade with no issues. Used to use something called Bumblebee to swap between them depending on workload, actually worked surprisingly well given the circumstances. Eventually I just dropped back to integrated only when I stopped doing anything intensive with the machine.
I run Arch as well and AMD is only "good". I would have a problem every now and then where my RX560 would lose its mind coming out of sleep and I'd have to reboot.
But the other problem that really bugs me is the "AMD reset bug" that you trip over with most AMD GPUs. This is when you pass through a second GPU through to another OS running under KVM, and is what lets you run Linux and (say) Windows simultaneously with full GPU hardware acceleration on the guest. The reset bug means the GPU will hang upon shutdown of the guest and only a reboot will let you recover the card. This is a silicon level bug that has existed for many years across many generations of cards and AMD can't be arsed to fix it. Projects like "vendor-reset" help for some cards, but gnif2 has basically given up (he mentioned he even personally raised the issue with Lisa Su). Even AMDs latest cards like the 7800 XT are affected. NVidia works flawlessly here.
> CUDA is the only reason I have an Nvidia card, but if more projects start migrating to a more agnostic environment, I'll be really grateful.
What AMD really needs is to have 100% feature parity with CUDA without changing a single line of code. Maybe for this to happen it needs to add hardware features or something (I see people saying that CUDA as an API is very tailored to the capabilities of nvidia GPUs), I don't know.
If AMD relies on people changing their code to make it portable, it already lost.
The idea was supposed to be people convert cuda to hip, which is a pretty similar language, either by hand or by running a tool called 'hipify' that comes with rocm. You can then compile that unmodified for amdgpu or for nvptx.
I think where that idea goes wrong is in order to compile it unmodified for nvptx, you need to use a toolchain which knows hip and nvptx, which the cuda toolchain does not. Clang can mostly compile cuda successfully but it's far less polished than the cuda toolchain. ROCm probably has the nvptx backend disabled, and even if it's built in, best case it'll work as well as clang upstream does.
What I'm told does work is keeping all the source as cuda and using hipify as part of a build process when using amdgpu - something like `cat foo.cu | hipify | clang -x hip -` - though I can't personally vouch for that working.
The original idea was people would write in opencl instead of cuda but that really didn't work out.
> I see people saying that CUDA as an API is very tailored to the capabilities of nvidia GPUs
I'm wondering how true that is, because that could give NVidia issues in the future if they need to redesign their GPU should they hit some limit with the current designs. Dependence on certain instruction makes sense, but there's not technical preventing AMD from implementing those instructions, only legal mumbo jumbo.
That's a fun idea. Qemu parses a binary into something very like a compiler IR, optimises it a bit, then writes it out as a binary for the same or another target in JIT like fashion. So that sort of thing can be built. Apple's rosetta is functionally similar, I expect it does the same sort of thing under the hood. Valgrind is another from the same architecture.
It would be a painful reverse engineering process - the cuda file format is sort of like elf, but with undocumented bonus constraints, and you'd have to reverse the instruction encoding to get sass, which isn't documented, or try to take it directly to ptx which is somewhat documented, and then convert that onward.
It would be far more difficult than compiling cuda source directly. I'm not sure anyone would pay for a cuda->amdgpu conversion tool, and it's hard to imagine AMD making one as part of ROCm.
Why would I blame NVIDIA? If it wasn't for them, we'd still only have needlessly cumbersome APIs and ecosystems. They did what Khronos always failed to do: They created something that is both easy, powerful and fast. Khronos always heavily neglects the easy part.
How are they preventing the competition to create something better than CUDA? And how does it hurt the consumers that they are providing a fantastic product that others refuse to provide?
I see these complains from time to time and I never understand them.
I've literally been running nvidia on linux since the TNT2 days and have _never_ had this sort of issue. That's across many drivers and many cards over the many many years.
Describing kernel panics and general nightmare scenarios as the general course with Nvidia doesn’t make sense either.
Nvidia has 80% market share of the discrete GPU desktop market and at least 90% market share of cloud/datacenter.
Nvidia GPUs are used almost exclusively for every cloud powered AI service and to train virtually every ML model in existence. Almost always on Linux.
Do you really think any of this would be possible if what you are describing was anything approaching the typical experience starting at the /driver/ level?
Nvidia would have never achieved their market dominance nor held on to it this long if the issues you’ve experienced impacted anything approaching a statistically significant number of users or applications.
Nvidia gets a lot of hate on HN and elsewhere (much of it fair) but I will never understand the people who claim it doesn’t work and get the job done (often very well).
People use flakey software all their time. As long as it mostly works most of the time most people put up with it. Examples: Windows in the 90’s and 00’s, or any AAA game on first release in the last 10 years.
I have a friend at the Facebook AI Research lab and I assure you they would not tolerate any level of fundamental flakiness from their 8,000 GPU cluster. Talent, opportunity cost, and time to market in general is so crucial in AI no one has any time or patience for the "oddball Linux desktop" experiences people are describing here.
Gaming users may tolerate some flakiness for their hobby but these AI companies dealing in the nine-figure range (minimum) absolutely do not.
My guess is when FB does run into such flakiness they email ____.____@nvidia.com as part of some support contract they have and go "Yo, we see this issue, figure it out and fix it".
But I can promise you after reading things like the LKML for decades and a number of different Microsoft blogs, that everyone on this planet experiences flakiness issues at times and has to figure out how to adjust their workload to avoid it until the issue is discovered and fixed.
He has described to me, in detail, some of the challenges they have had. I'm not saying it's exhaustive but I'm pretty sure if their experience with the fundamental software stack was what people here are claiming I would never hear the end of it.
Actually, no. Obviously they have Nvidia support but in one especially obscure issue he was describing Meta took it as an internal challenge and put three teams on it in competition. Naturally his team won (of course) ;).
Of course all software has flakiness - I'm not taking the ridiculous position that Nvidia is the first company in history to deliver perfect anything.
What I am saying is these anecdotal reports (primarily from Linux desktop hobbyists/enthusiasts) of "It's broken, it doesn't work. Nvidia sucks because it locked up my patched kernel ABC with Wayland XYZ on my bleeding edge rolling release and blah blah blah" (or whatever) are extreme edge cases and in no way representative of 99% of the Nvidia customer base and use cases.
Show me anything (I don't care what it is) and I'll find someone who has a horror story about it. Nvidia gets a lot of heat from the Linux desktop situation over the years and some people clearly hold an irrational hatred and grudge.
Nvidia isn't perfect but it's very hard to argue they don't deliver generally working solutions - actually best of breed in their space as demonstrated by their overwhelmingly dominant market share I highlighted originally.
On the flip side, one of the reasons I'm loyal to nvidia is a combination of two things.
1. They supported linux when no one else did,
2. I've never experienced instability from their drivers, and as I mentioned before, I've been running their cards under linux since the TNT2 days.
That was my experience, Nvidia Optimus (which is what allows dynamic switching between the integrated and dedicated GPU in laptops) was completely broken (as in a black screen, not just crashes or other issues) for several years, and Nvidia didn't care to do anything about it.
Yeah, Optimus was a huge PITA. I remember fighting with workarounds like bumblebee and prime for years. Also Nvidia dragged their feet on Wayland support for a few years too (and simultaneously was seemingly intent on sabotaging Nouveau).
I tried bumblebee again recently, and it works shockingly well now. I have a thinkpad T530 from 2013 with an NVS5400m.
There is some strange issue with some games where they don't get full performance from the dGPU, but more than the iGPU. I have to use optirun to get full performance.
It also has problems when the computer wakes from sleep. For whatever reason, hardware video decoding doesn't work after entering standby. Makes steam in home streaming crash on the client, but flipping to software decoding usually works fine.
The important part is that battery life is almost as good with bumblebee as it is with the dGPU turned off. No more fucking with Prime or rebooting into BIOS to turn the GPU back on.
I understand it, but I also haven't had any trouble since I figured out the right procedure for me on fedora (which probably took some time, but it's been so long that I can't remember). Whenever I read people having issues it sounds like they are using a package installed via dnf for the driver/etc. I've always had issues with dkms and the like and just install the latest .run from nvidia's website whenever I have a kernel update (I made a one-line script to call it with the silent option and flags for signing for secure boot so I don't really think about it). No issues in a very long time even with the whackiness of prime/optimus offloading on my old laptop.
I have been NVIDIA cards for compute capabilities only, both personally and at work, for nearly a decade. I've had dozens and dozens of different issues involving the hardware, the drivers, integration with the rest of the OS, version compatibilities, ensuring my desktop environment doesn't try to use the NVIDIA cards, etc. etc.
Having said that - I (or rarely, other people) have almost always managed to work out those issues and get my systems to work. Not in all cases though.
I use a rolling distro (OpenSUSE Tumbleweed) and have had zero issues with my NVIDIA card despite it pulling the kernel and driver updates as they get released. The driver repo is maintained by NVIDIA itself, which is amazing.
I do all of those things with my 3070 and it works just fine. Most of them will depend on your DE's Wayland implementation.
I'm not here to desparage anyone experiencing issues, but my experience on the NixOS rolling-release channel has also been pretty boring. There was a time when my old 1050 Ti struggled, but the modern upstream drivers feel just as smooth as my Intel system does.
I often have issues booting to the installer or first boot after install with an NVidia GPU.
Pop_OS, Fedora and OpenSUSE work out of the box. Those are all Wayland I believe. Debian/Ubuntu distros are a bad time. I think they’re still X11. It’s ironic because X11 is supposed to be the more stable window manager.
I think they moved to Wayland on 23.04 or 23.10. I just recently installed both to try and get a 7800xt working with PyTorch and the default was Wayland.
Those problems might just be GNOME-related at this point. I've been daily-driving two different Nvidia cards for ~3 years now (1050 Ti then 3070 Ti) and Wayland has felt pretty stable for the past 12 months. The worst problem I had experienced in that time was Electron and Java apps drawing incorrectly in xWayland, but both of those are fixed upstream.
I'm definitely not against better hardware support for AI, but I think your problems are more GNOME's fault than Nvidia's. KDE's Wayland session is almost flawless on Nvidia nowadays.
I'm using KDE on Debian 12 with AMD GPU with Wayland, and works. it keeps being a bit annoying compared with X11 with a few programs (Eclipse, Dbeaver... I need to launch both with flags to not use Wayland backend). But even I can play AAA games without problems
Nvidia on Linux is more like running Windows 95 from the gulag, and you're covered in ticks. I absolutely detest Nvidia because of the Linux hell they've created.
>> Yeah, nvidia linux support is meh, but still much better than amd.
Can not confirm. I used nvidia for years when it was the only option. Then used the nouveau driver on a well supported card because it worked well and eliminated hassle. Now I'm on AMD APU and it just works out of the box. YMMV of course. We do get reports of issues with AMD on specific driver versions, but I can't reproduce.
This week I upgraded my kernel on a 2017 workstation to 6.5.5 and when I rebooted and looked at 'dmesg' there were no less than 7 kernel faults with stack traces in my 'dmesg' from amdgpu. Just from booting up. This is a no-graphical-desktop system using a Radeon Pro W5500, which is 3.5 years old (I just had the card and needed something to plug in for it to POST.)
I have come to accept that graphics card drivers and hardware stability ultimately comes down to whether or not ghosts have decided to haunt you.
Guess I'm also doing something wrong. Never had any serious issues with either Nvidia or AMD on Linux (and only a few annoyances on RNDA2 shortly after release)...
I never had an issue with nVidia drivers on Linux in the past 5 years, but recently bought a laptop with a 4090 and AMD CPU. Now I get random freezes, often right after I login into Cinnamon but can't really tell if it's the nVidia driver for 4090, AMDGPU driver for integrated RDNA, kernel 6.2 or Cinnamon issue. The laptop just hangs and stops responding to keyboard so I can't login to console and dmesg it.
That might be a philosophical problem that never prevented me from training models on Linux. The half-baked half-crashing AMD solutions just lead to wasting time I can spend on ML research instead.
In the closed source days of fglrx or whatever it's called I'd agree. Since they went open source, hard disagree. AMD graphics work in Linux about as well as Intel always has.
Running Nvidia in Linux isn't as much fun. Fedora and Debian can be incredibly reliable systems, but when you add an Nvidia card, I feel like I am back in Windows Vista with kernel crashes from time to time.