Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
DeepLearning11: 10x Nvidia GTX 1080 Ti Single Root Deep Learning Server (servethehome.com)
83 points by tim_sw on Oct 29, 2017 | hide | past | favorite | 30 comments


It's nice to see NVidia's behavior here is more widely known now. To this day, I do not understand the justification for asserting that one cannot install GeForce GPUs wherever one wants to install them. I really wish that Nvidia felt more secure that killer Tesla features like NVLINK and the Tensor Cores were sufficient product differentiation to make purchasing Tesla GPUs worth the price.

And that's because to the best of my knowledge, most of the CUDA ecosystem out there was developed on GeForce GPUs.

There is currently no GeForce equivalent of Volta at a time when the underlying programming model has undergone some traumatic changes that really alter the way to write efficient code going forward. If the only way to access Volta GPUs turns out to be AWS instances at $25/hour or $150,000 DGX-1V servers plus hosting costs, I suspect a lot of existing CUDA code will bitrot.

Imagine a near future where AMD Vega GPUs are faster than GTX 1080 TI at FP16 training and inference for deep learning. Without some sort of successor to that GPU, I really think that could happen because Nvidia went out of its way to cripple FP16 performance on GeForce.


1080Ti doesn't have double speed FP16, Vega 64 should be more than twice as fast for half precision training/inference. The problem is that AMD is has been lacking in framework support and fast optimised kernels. This seemed promising - https://news.ycombinator.com/item?id=15516166


It's worse than that, the FP16 MAD in GTX 1080 TI is significantly slower than converting 2 FP16 numbers to FP32, performing FP32 math on them, and accumulating results in FP32. Had it been the same speed, I don't think I would be anywhere near as annoyed as I am with this crippling.

That said, at Vega's 26 or so FP16 TFLOPS, it wouldn't be hard for NVIDIA to release a GeForce Volta with 30-40 tensor core TFLOPS, that both stomped on Vega and remained significantly inferior to V100. Given how hard it is to program Volta optimally, I'm surprised they haven't done so already, if only as a Titan XV Edition that can only be purchased from their website.


Yeah, will be interesting to see if they can bring the price of GPUs with tensor cores down, from what I've read it's too expensive to make them right now for consumer market.

BTW - do you happen to know what DL frameworks currently support mixed precision training with Volta tensor cores? Curious to see if AWS V100 instances can really do 120 TFLOPS as advertised. I think latest versions of CUDA/CuDNN support Volta now?


If this article is to be believed, they are not happy about people doing this:

https://www.pcgamesn.com/nvidia-geforce-server

And while there is no GeForce equivalent to Volta today, that will not be true in the near future. At some point they will come out with a new GeForce line of cards. In the past, the GeForce generation was either before the Teslas, or just slightly after. I also don't agree that there is nothing competing with the voltage right now on the GeForce line. The 1080ti is not the same performance, but if you are willing to have multiple cards, two of them are just as good or better than the V100.


I hope Volta based gaming cards will happen sometime in 2018 and they don't gimp it too much. V100 kills it with mixed precision compute for deep learning because of the Tensor Cores. If you can get mixed precision training working, you can get 120TFLOPS out of it. You'll need 10x 1080ti's to reach that


Has there been much progress towards making AMD cards work with popular deep learning libraries?

There's https://rocm.github.io/index.html but it's not quite clear to me how far they got and how usable it is today.


There is a vendor supported tensorflow for AMD devices: https://github.com/ROCmSoftwarePlatform/hiptensorflow


You start by talking about Nvidia’s behavior but nowhere in the article is there any mention of that, so you’re leaving us to guess what you’re talking about.


Perhaps he is talking about this?

> NVIDIA specifically requests that server OEMs not use their GTX cards in servers. Of course, this simply means resellers install the cards before delivering them to customers.


Price discrimination works and if NVidia didn't use it they'd probably jack up prices on GeForces instead.


As long as AMD exists, I doubt it.


The commodity GeForce Nvidia cards versus the Enterprise Tesla NVidia cards is history repeating itself, with enterprise disks vs commodity disks. We all know who won in the end (commodity disks, of course).

The 1080Ti card is about 60% of the performance of the P100, but costs 700 dollars instead of 5k dollars. Of course, people will try and build these boxes, they have a much higher ROI compared to Nvidia's DGX-1. So what do NVidia do? Try and stop vendors from selling them! https://www.pcgamesn.com/nvidia-geforce-server


It's probably better to analogize it to Xeon, which prints Intel roughly as much cash as the tax revenues of Croatia.

There are plenty of things to arbitrarily segment in a GPU - HBM, large memory sizes, tensor cores, FP64.

The problem with hard disk segmentation wasn't the idea of segmenting at all, it's that there were no good ways of doing it. A higher MTBF? You're better off buying more, cheaper disks. Density (eg from helium)? Might be worth a 20-50% premium in $/gb, but not 200-500% $/transistor as with Xeon.


There are some other knobs that NVIDIA will surely try (has surely tried?) to tweak as well, such as making the consumer devices deliberately less power efficient, fusing off chip features which make compute workloads more efficient, or sabotaging the drivers (either in general or on a per-application basis).

With disks there was always just as much pressure on the consumer side to keep energy down, cost down, and capacity up, which means that there was no natural segmentation, and no straightforward unnatural segmentation.


I think you're vastly underestimating amount of work that goes into those things. What if instead what you suggest they do not invest as hard in gaming parts? Design, QA, QC and support cost money after all.


Is the initial cost of purchase the largest factor? Or are running costs enough to tip the balance?


Read the article. Running costs can be quite high (up to 1000 dollars/months with colocation/energy), but they are still massively cheaper than running 24x7 the same GPUs in the cloud.


I took the question to be about the choice between the two products.


Where did you get that the 1080ti is 60% of the performance of the p100? It is slightly higher performance than the p100, but rated with a lower mtbf. Of course, you end up with many crippled features, like RDMA and dynamic parallelism, which most people probably don't need anyways.


Deep learning training is basically 2 phases: the forward pass and the backwards pass (backpropagation). There are no loops. As such, most DL training is memory bound. Memory bandwidth of 1080TI is 484.4 GB/s Memory bandwidth of p100 is 720.9 GB/s Memory bandwidth of v100 is 900.1 GB/s


For applications that are not memory bound, the 1080ti is better. Also, if we are talking about deep learning, inference can be handled with lower precision. The p100 does not have int8 support and hardware, but the 1080ti does. So presumably for the same memory bandwidth you get twice as much data.


How do they keep those cards cool when they are packed so close together? I built a crypto currency mining rig once and I had to have a reasonable amount of separation between the cards so the side mounted fans had somewhere to pull air from.


Very interesting. I just have a single of these (Gigabyte Aorus 1080 TI), and have recovered 1/3rd (or a bit more, with the latest bitcoin spike) of the purchase price over 4 months doing mining.

I'd like to ah, learn about machine learning so here's to hoping Nvidia doesn't nerf its deep learning capabilities on the driver level.


I gave a talk at the Spark Summit Europe last week, where i went into detail on this server and how we can scale out deep learning training on Tensorflow with it and AllReduce (by Uber): https://www.slideshare.net/secret/A7b9rAsLaipg6

TLDR; You can scale out distributed Tensorflow training to tens/hundreds of GPUs with AllReduce on machines like this one, not just on the DGX-1.


file is marked private ?!



can you set your slideshare to public? thanks!


Fixed.


Why use a dual Xeon E5-2650 V4 instead of a single Epyc 7401P?

Are there no appropriate mainboards yet?




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: