You can copy that spreadsheet and insert whatever system price (in kilodollars) you want into B15:F15. Hope this makes everybody's decision making easier.
As a system builder and AI research company, we're trying to make benchmarks that are scientific, reproducible, correlate with real world training scenarios, and have accurate prices.
There was a good talk from NVIDIA at last years GTC:
Here is another relevant blog post: https://devblogs.nvidia.com/mixed-precision-training-deep-ne...
EDIT: Also not everything in the training loop is a matrix multiplication where tensor cores are useful.
These benchmarks are for training, so the expectation is that they are running them in fp16 all the way through. Also, tensor cores can accumulate in fp32 registers with a slight hit to performance.
With carefully tuned Transformer (matmul-heavy!) I could only make twice as fast as 1080ti (at 4 times the price).
The only undisputable benefit was using double batch size.
If you can write a better optimized network, go ahead. But like in SSE2 vs AVX2 vs AVX512 benchmarks, FP performance on paper doesn't always translate into better real world FP performance. Now if Nvidia had switched to HBM2 like Google TPU2, it might be different.
this one is more oriented to the lower end.
- 37% faster than the 1080 Ti with FP32, 62% faster with FP16, and 25% more expensive.
- 35% faster than the 2080 with FP32, 47% faster with FP16, and 25% more expensive.
- 96% as fast as the Titan V with FP32, 3% faster with FP16, and ~1/2 of the cost.
- 80% as fast as the Tesla V100 with FP32, 82% as fast with FP16, and ~1/5 of the cost.
No, AMD gpu is zero cost effective because Tensorflow does not support AMD gpus.
> But people are locked in the Nvidia proprietary jail and no one seems to care...
Sounds like you want to blame the users, but this is because Nvidia highly invested on GPGPU and Cuda since more than 10 years ago, while AMD did focus on something else like HSA. It is AMD’s fault.
MIOpen is a step in this direction but still causes the VEGA 64 + MIOpen to be 60% of the performance of a 1080 Ti + CuDNN based on benchmarks we've conducted internally at Lambda. Let that soak in for a second: the VEGA 64 (15TFLOPS theoretical peak) is 0.6x of a 1080 Ti (11.3TFLOPS theoretical peak). MIOpen is very far behind CuDNN.
Lisa Su, if you're reading this, please give the ROCm team more budget!
If they did that and had a card that got 2x performance / $ or more I would switch in a heartbeat.
The quality and open source nature of their tools has resulted in much of my research group (real-time vision) increasingly and voluntarily moving to work on AMD platforms (we were previously almost exclusively using CUDA).
Also, AMD does not limit FP performance on consumer cards.
I don't understand this meme. The consumer cards are different chips with slow fp64 hardware. In what sense is that "limiting" performance relative to the enterprise cards?
"For their consumer cards, NVIDIA has severely limited FP16 CUDA performance. GTX 1080’s FP16 instruction rate is 1/128th its FP32 instruction rate, or after you factor in vec2 packing, the resulting theoretical performance (in FLOPs) is 1/64th the FP32 rate, or about 138 GFLOPs."
"FP16 performance is 1/64th and FP64 is 1/32th of FP32 performance."
If we're only counting model training, it runs on CPUs, Google's TPUs, FPGAs, whatever other secret datacenter ASICs are out there, various DL-specific mobile chips, etc.
Way more than 5% of the world's hardware can run inference with deep neural nets, which is the important thing for mass adoption, and definitely more than only Nvidia GPUs can run training.
Maybe in theory, but you can get a new 1080Ti for $700 while 2080Ti is impossible to get under $1200, which makes it 70% more expensive. At that point 2x 1080Ti sounds way better than 1x 2080Ti for Deep Learning to me (up to 22TFlops and 22GB of RAM).
1.75 / 1.36 (speed up of 2080 Ti over a single 1080 Ti) = 1.28. So expect 2x 1080 Ti to be about 30% faster.
You can see how multi-GPU training works with Titan V benchmarks in the link below. 1080 Ti have similar scaling profile.
It's been a long time since GPU prices have been anywhere near MSRP, it seems dishonest to assume that they will be in the near future.
In the past, it has been common that new hardware with low initial supply fetched premiums that lasted until supply met demand. It's not dishonest to assume that it's the same case here.
While we could decide to write some software that could pull today's market prices and update tables and graphs dynamically, we just decided to settle on a single number. If you are settling on a single number, the choice is either market price on date of publication or MSRP. Given that other GPUs tend towards their MSRP as time goes on, we decided to choose MSRP.
By the way, the MSRP for a regular 2080Ti is 999$, the 1200$ is the Founder's Edition price.
GPU modules are manufactured in China. Their harmonized codes are covered in recently established tariffs. 10% tariffs are already hitting cards arriving at US ports. This tariff will increase to 25% on Jan 1.
Prices will stay well above MSRP.
1080 Tis are back under MSRP right now and have been for some time. We decided to assume that 2080 Tis would exhibit similar behavior.
I feel like thats becoming tensorflows 'native platform'...
TPUv2 were benchmarked against NVIDIa's K80 at 30 times the performance and had a peak of 92 TOPS  while the 2080 TI is at 440 TOPS 
nVidia is destroying the TPUs right now and Google is desperate to keep their perception in the public eye as the king of AI (which tbf, they probably are, compute capabilities aside)
In this case the data was from the time Google introduced the TPU internally, when the K80 was very much up-to-date. It also makes sense because the K80 was the only GPU offered in GCP.
Also, there's no need for assumptions when you know what's going on in the design team.
(disclaimer: while I'm part-time at Google, this is my personal impression, not an official statement, etc., etc.)
(420 TFlop/s, 128GB HBM)
The price/performance ratio of rented TPUv2 or V100 can't match the price/performance ratio of owning the system if you are doing lots of learning/inference.
If the model fits inside 2080 Ti and the work is not tightly time restricted, 2080 Ti (the whole $2.5k system) should be more economic choice after six months or less (full utilization 24/7).
DAWNBench does benchmarks.
"At the time DAWNBench contest closed on April 2018, the lowest training cost by non-TPU processors was $72.40 (for training ResNet-50 at 93% accuracy with ImageNet using spot instance). With Cloud TPU v2 pre-emptible pricing, you can finish the same training at $12.87. It's less than 1/5th of non-TPU cost. "
Here is a link to DAWNBench.
The RTX 2080 Ti, on the other
hand, is like a Porsche 911.
It's very fast, handles well,
expensive but not ostentatious
Only compared to a car which costs millions could it be considered reasonable and not ostentatious.
Guess I'll be sticking to my "poor mans" 970 GTX :p
I feel like he answered that question directly, by comparing it to an ostentatious computer. ;)
“And if you think I'm going overboard with the Porsche analogy, you can buy a DGX-1 8x V100 for $120,000 or a Lambda Blade 8x 2080 Ti for $28,000 and have enough left over for a real Porsche 911. Your pick.“
Kinda depends on which Lamborghini; an Urus probably does a better job at paying for normal in a grocery store parking lot (and is better for actually carrying groceries) than a 911.