Hacker News new | past | comments | ask | show | jobs | submit login
Tensorflow 2.0.0-Alpha0 (github.com)
81 points by Gimpei 46 days ago | hide | past | web | favorite | 38 comments



One thing I've always wondered is, why is TensorFlow NVIDIA-only? I was hoping a new major release would fix that, but it doesn't appear to have.

I've recently gotten an eGPU for my Macbook Pro for playing games in whatever little off-time I have on Windows. I also wanted to use it on Mac, though, and Macs only support AMD graphics. So I got a Vega 64. The other half of the reason was that I wanted to play with deep learning.

Turns out TensorFlow just does not work on AMD. There is a fork of it maintained by AMD, but they only support it for Linux, and the code in question reaches deep enough to hardware layer that it cannot be run in a VM. It's also always at least a few versions behind as well.

With NVIDIA GPUs more expensive and poorly performing per dollar than they have ever been, TF2 could have been the moment they made a major improvement. Sad to see that is not the case.


Nvidia spends more money on marketing, developers, tooling, documentation, infrastructure, and publishing high quality, usable libraries for GPU numerical computing. They have been doing this for years (CUDA is +10 years old) and are very good at it, while HIP/HCC are just barely usable. This is basically all it comes down to, in a nutshell. I want to be clear that none of this is trivial in any sense; delivering a usable GPGPU software stack is extremely difficult work, and Nvidia is simply miles ahead of AMD in nearly every way, except the fact ROCm and the AMDKFD driver are going upstream to Linux. AMD is of course also a smaller company than Nvidia, as well, so their resources are more limited.

Here's the thing: In reality, it doesn't matter if a Radeon VII or Vega64 has better theoretical TFLOPs than a 1080 Ti or whatever if the card runs 50W hotter(!!!) and the software is bad enough it runs at 1/2 the performance of the competitor. 50W hotter at a 10% perf loss and a 40% decrease in price is attractive. Hotter at 1/2 the perf, is a non-starter (almost entirely ignoring price factors). Free drivers do not make up for this. DL is a fast enough field that, for most practitioners, your time is better served as an enthusiast/participant by just paying the extra upcharge on an NVidia card and getting on with your life, so you can actually train models and do your work.

There is only one viable solution if you want to "do your work" today -- not for better, and almost certainly for worst. Ceterum censeo.

DL moves fast enough that it's always possible for new frameworks to come around, so perhaps a competitor with a multi-vendor design, from the ground up, can break though. But then you have to ask whether a lot of these frameworks can even succeed/tread water in such an environment without Google/Facebook propping them up, which is a separate problem... And before that, AMD has to make sure HIP/HCC have to work reliably, every time, everywhere, and they have the libraries to back it up. I'm hoping the succeed at this.


AMD's software stack is definitely holding them back. But the hardware in smaller tests I've done really can get close to the theoretical TFLOPs / Memory Bandwidth numbers that they tout.

NVidia made the decision to hop on the Clang bandwagon years ago, and clearly that's allowed them to more easily optimize their code and move things forward. AMD finally has hopped on Clang with ROCm (but unfortunately still has some kinks to work out, esp with OpenCL)... and the compiler output seems to be superior as a result.

-------------

It seems unlikely to get a value-proposition out of AMD if you're going for deep learning. NVidias libraries are simply superior at the moment, and they get superior performance as a result.

But when you look at lower-level software that is written for both GPUs, AMD's compute girth really shows. Cryptocoins are perhaps the best example, although not every work-load looks like mining.

Luxray and Blender are more pragmatic: software raytracers which demonstrate AMD's incredibly GPU girth and memory performance with HBM2. The AMD Vega chips can outperform NVidia's equivalent in these cases.

-----------

All in all, AMD's ROCm environment seems like it is ready for programmers who are willing to stick to the C (or HCC) level, coding their own libraries and maybe using a bit of inline assembly here or there to maximize performance.

CUDA has more libraries (BLAS: linear algebra, better Tensorflow support for Deep Learning, etc. etc.) and that's definitely a major selling point. Especially because of how well those libraries are written.

I guess what I'm trying to say is... AMD's hardware is fine. Its their software thats clearly holding them back.


AMD dropped the ball by not dedicating the sort of resources that nvidia has on the deep learning space (and playing the long, but risky move of throwing all in with ROCm). Similarly, Apple dropped the ball by being so wed to AMD.

Coincidentally I was just digging through the Tensorflow code to find where it delegates to CUDA -- an enormous task given that it is a huge volume of dependencies many layers deep -- because ultimately most neural nets entail a significant amount of simple math. It seems, and I say this in the nonsensical, overly confident way that we do when we look on from the outside, that it should be trivial making it work with OpenCL or even just a branch for AMD.


'...it should be trivial...'

...but incredibly time consuming. Developing bug free, high performance op implementations takes time and elbow grease. It's worth comparing the gradual progress of tf lite (inference on arm devices) and the tpu architectures: rolling out a full set of ops for a new platform is a multi year project. When it's only partially done, you get inconsistent support and weird performance bottlenecks; so maybe a minimal mnist model is fast, but many other things pass back to cpu for lots of intermediate ops, killing performance.


ROCm, or more specifically HCC, is comparable to CUDA: AMD needs to develop it more before anything else. Its the fundamental structure to GPU-coding.

AMD can't even write a comparable library unless it is built on top of ROCm / HCC. Like, how would you even start to implement Tensorflow 2.0.0 for AMD Graphics cards otherwise?

OpenCL was theoretically an option. But the old AMDGPU-pro driver had the OpenCL compiler as part of the driver stack. Which means OpenCL Compiler issues would appear and disappear as you updated your graphics card. Uhhhh... not good for deployment to say the least. Lots of OpenCL code with "This causes the compiler to enter an infinite loop on AMD Driver 12.5.whatever".

There are serious issues about OpenCL's fundamental design. You really shouldn't have a full OpenCL compiler hidden inside of your device drivers. Its a nightmare to test and deploy, especially in a world of changing device drivers.

--------

ROCm is a proper compiler stack and a proper development environment. If anything, AMD's mistake was taking so long before ROCm became usable. Even today, there are some issues (it doesn't work with Blender 2.79 OpenCL code yet), but at least its a proper development environment where you can build libraries on top of.

AMD needs to get its compute stack in order if it wants to be taken seriously. Fortunately, it seems like they understand the issue. Hopefully the next year or two will be helpful, since they're ramping up on ROCm development.


AMD is a far smaller company than either nVidia or Intel, its two main competitors.

There are a variety of reasons for this - anticompetitive behavior from those competitors, chip designs like Bulldozer that focused on many wide execution units at the expense of single threaded performance, etc., but what it comes down to is that they just don't have a comparable amount of resources dedicated to ROCm compared to what nVidia has invested in CUDA.

I'm quite impressed with their current and near future designs in the CPU space, but GPU design is hard and beginning to hit a performance wall - witness the lackluster reaction to nVidia's current RTX chips in many areas.


I think the reason for the lackluster reception NVIDIA has gotten to the current RTX designs is only in a small part caused by the actual poor design. Yes, real-time ray tracing, the new feature in these cards, is so bad that it's barely usable — but we're used to that, v2 will be better. I have a lot of sympathy for NVIDIA engineers trying to improve the state of the art, it's never easy.

However, the issue with NVIDIA is that they have adopted the antics of the game development space they cooperate with, and they have a poor attitude towards their customers as a result. So far as outright saying that their customers are stuck, so that the pricing doesn't matter, you'll just pay for it and get on with your life anyway. One major example of this is the banning of their 'gaming' cards from data centres via an EULA, because they couldn't sell their 10X-priced 'enterprise' cards, since they performed within 10% of each other.

Another example is RTX, again. They've stopped producing their 10xx series cards to sell more RTXes at a higher margin, because, boy it turns out people still want and prefer older cards compared to the RTX ones, in consideration of the price-bloat that NVIDIA has saddled the RTX with.


> Yes, real-time ray tracing, the new feature in these cards, is so bad that it's barely usable — but we're used to that, v2 will be better. I have a lot of sympathy for NVIDIA engineers trying to improve the state of the art, it's never easy.

Can you explain this more? All the tests of RTX in actual games have been very positive and showed significant improvements in visual quality.


The results I've seen show the $1200 GPU playing at 1080p resolution with less than 60FPS.

RTX is clearly a compute-hog: its barely usable even if you pay for the best-of-the-best GPUs. I mean, not to knock Nvidia down or anything, raytracing is one of the most computationally difficult problems in existence right now.

But from a practical and pragmatic perspective, you suffer a major loss in frame-rates and force the GPU to drop in resolution to make the jump to raytracing. Even if you spent $1200 on the card...

-----------

I'm personally excited to see offline renderers use the RTX features to accelerate offline raytracing. That's probably the more important use of the technology. As it is, RTX isn't quite fast enough for "real time" yet. Just grossly accelerated offline raytracing (which is still impressive)


According to PC Gamer and Digital Foundry, the 2080Ti can drive Metro Exodus with RTX on at average 55fps with drops down to 30fps at 4K resolution so your claim seems grossly exaggerated - https://cdn.mos.cms.futurecdn.net/YhDHpgGrAmpmnP4LUBEgvg.png

Also last I read, RTX really isn't designed for offline raytracing and doesn't really bring much to the table. Its use is in realtime.


https://www.hardocp.com/article/2019/01/07/battlefield_v_nvi...

Battlefield V Ultra 1080p RTX on is ~63 FPS average (not minimum FPS, but average), which means it will regularly dip below 60 in practice.

> Also last I read, RTX really isn't designed for offline raytracing and doesn't really bring much to the table. Its use is in realtime.

On the contrary! RTX Cores are a hardware-accelerated BVH-traversal. That has HUGE implications for the offline rendering scene.

See NVidia Optix for details: https://devblogs.nvidia.com/nvidia-optix-ray-tracing-powered...

IMO, this is the killer-feature of RTX. Accelerating those Hollywood Renders from hours-per-frame to minutes-per-frame. NVidia Optix takes industry-standard scene trees and can use the RTX Cores to traverse them for hardware-accelerated raytracing.

Or more specifically: coarse AABB Bounds checking, which is a very compute & memory heavy portion of the Raytracing algorithm.

Even if RTX is too slow for video games, any improvement to offline rendering is a huge advantage.


Optix was developed a decade ago and doesn’t require RTX cards. It run on any Maxwell or newer card. It’s NVIDIA’s version of RadeonRays.

What NVIDIA is trying to sell people is a denoiser.

And are you sure RT Cores even exist or are they just tensorcores by a different name?

https://youtu.be/3BOUAkJxJac


In the RTX announcement speech, they repeatedly mentioned the savings for offline rendering (taking fewer computers less time and fewer watts to render the same quality). I came away from watching that thinking that the card was primarily positioned for offline rendering.


My impression from the analyses I've seen so far is that it barely improves the visual quality, since the game designers learned to fake it so well over the last two decades, and it comes at a significant cost to frame rate. In particular, good ray-tracing with acceptable noise levels comes at around 500 rays a point. RTX cards are capable of 2 rays a point.

Here is a comparison of the regular global illumination vs RTX real-time raytracing, particularly take look at noise levels generated by RTX in the background. https://www.youtube.com/watch?v=CuoER1DwYLY


>In particular, good ray-tracing with acceptable noise levels comes at around 500 rays a point. RTX cards are capable of 2 rays a point.

NVidia has that temporal denoising algorithm though. So noise is smoothed between space and time. When you have 60-frames per second, you can "recycle" old frame's raytracing data for the current frame.

As long as the raytracing lights rarely change, it'd work out pretty well. The fast-moving lights would probably have to still use rasterization techniques, but the demos of the temporal-denoising algorithm from NVidia are very impressive.

FYI: Blender Cycle's denoising is still kinda bad even at 1000 or even 5000 samples (at least, when you look at low-light areas close to walls). You gotta zoom in and look, but artistic CGI requires many, many samples before raytracing looks smooth. Its a major advancement for NVidia to come up with a raytracing denoising algorithm that works well for video games on only 1-ish samples per pixel.


Theoretically AMD has done at least some of the work via ROCm. I have not tried using this so I don't know how tested it is.

https://rocm.github.io/tensorflow.html


I think this is primarily due to the immense effort that NVIDIA has put into CUDA. It works very well and it is extremely fast. The alternatives for AMD are OpenCL and ROCm which have seriously lagged behind CUDA in every respect.

EDIT: lots of theories and discussion here: https://www.reddit.com/r/MachineLearning/comments/7tf4da/d_w...


CUDA is one hell of a drug.

That said, I'm still surprised Apple gave up on NVIDIA support, especially since there's no it-just-works option for deep learning on a Mac.


Apple and NVIDIA have been in each other's bad books since the 2012 Macbook Pro incident where discrete GPUs of those devices got hot beyond their thermal specs and melted themselves out of the soldered sockets, which NVIDIA refused to compensate Apple for. I know, because I had one of those. Apple took the hit to repair those Macbook Pros even while out of warranty, even if it was ultimately NVIDIA's poor engineering. They've not worked with each other since.

2012 Apple was pretty much the height of the quality apex.


Apple and NVIDIA have spent the last six months fighting over whose fault it is that NVIDIA doesn't have working drivers for the latest OSX.

https://create.pro/blog/nvidia-drivers-for-macos-mojave-time...


At least AMD is working hard to upstream their drivers into the Linux kernel, which I don't doubt Apple appreciate when they merge the code into theirs.


Keeping in mind that apple is primarily a consumer focused company,

What's the it-just-works consumer machine learning use case that requires GPGPU on end user hardware?


Apple has been pushing machine learning at virtually every WWDC and its integrations from developers into product launches since.

A developer can do the ugly hacking to get an NVIDIA eGPU working on macOS, but it's a hassle.


I'm not an expert on this and would love to hear from one, but as far as I know there's simply more work to be done on AMD's end, to provide fast implementations of all the operations TensorFlow, PyTorch, and others need. ROCm hasn't reached parity with CUDA.

Theano used to work with OpenCL and therefore AMD, but the support was never that great, and I don't know that many people used it. And Theano is now EOL.


NVIDIA invested heavily into drivers and libraries (CUDA and cuDNN) and other tools like TensorRT. They created the ecosystem where deep learning can be tuned and optimized to work.

AMD took minimal effort route. They just design graphics cards with some added functionality and if someone wants to write build libraries fro them, good.


It's just not worth trying to shoehorn a Mac into a deep learning role, and I say this as someone who works almost exclusively on OS X. If you don't want to build a dedicated deep learning box, it seems like there are some decent cloud options now (though I haven't tried them).



If you're interested in upgrading your models, also make sure to check out this Medium post: https://medium.com/tensorflow/upgrading-your-code-to-tensorf...


Thanks god Tensorflow is doing something with global variables and sessions



What are you using in your projects?

Is TF the dominant tool in commercial or startup DNN projects?


TF (or Keras via TF) is definitely the most popular framework for newer projects, although some argue PyTorch is holistically better.


I think pytorch is more popular in research, but TF is the most popular for commercial applications right now


I have to use TF at work (small startup), but I use Pytorch for personal projects and research.


I'm about to use Flux.jl for my next project.

I have previously tried using TF but it was super painful so I used Keras instead.


Any thoughts on how this version now compares to PyTorch 1.0?


Cool. I would love to see pypy support for this major version.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: