
Caffe2 adds 16 bit floating point training support on the NVIDIA Volta platform - bwasti
https://caffe2.ai/blog/2017/05/10/caffe2-adds-FP16-training-support.html
======
maldeh
From first looks, there is little doubt that that NVIDIA's Volta architecture
is a monster and will revolutionize the AI and HPC market. But the article
seems to avoid quantifying how 16-point FP operations are beneficial against
32- or 64-bit FP operations in real-world usecases, or how the Caffe2 / NVIDIA
architecture provides any significant boost to FP16 in particular, especially
apropos to images (or why FP16 is better for image data in general).

I'm interested more in understanding why Caffe2 would outperform Theano,
Tensorflow, MXNet, etc. once Volta chipsets are generally available, beyond
early pre-release optimization, particularly when most of the front-runners
are already leveraging / taking into account NCCL, CuDNN, NVLink, etc. When
the burden of adding support for new NVIDIA primitives is so low, what gives
Caffe2 an advantage beyond an ephemeral "we were partners with NVIDIA first"
one-up that would last for a couple of months at most?

(Apologies in advance if this post sounds overly negative, but I am constantly
evaluating the current crop of frameworks for the trade-offs they enforce on
the problem space, and a definitive answer would be very helpful.)

~~~
deepnotderp
Caffe2 seems to be engineered for efficiency as opposed to flexibility. For
example, Caffe2 is apparently Facebook's go-to choice for mobile edge device
deployment for applications such as style transfer.

~~~
throwaway87213
Efficient how ? In terms of memory management ?

~~~
eleitl
TPUs are smaller in terms of Si real estate and burn less power. Also, they
can be made much faster.

------
DocSavage
The AnandTech article on the Volta has a lot more information on the new
architecture: [http://www.anandtech.com/show/11367/nvidia-volta-unveiled-
gv...](http://www.anandtech.com/show/11367/nvidia-volta-unveiled-gv100-gpu-
and-tesla-v100-accelerator-announced)

It's interesting the speed up isn't more pronounced between Volta and Pascal
considering the Tensor cores on paper give you about 6x the MFlops. The price
differential looks large.

From AnandTech: "By the numbers, Tesla V100 is slated to provide 15 TFLOPS of
FP32 performance, 30 TFLOPS FP16, 7.5 TFLOPS FP64, and a whopping 120 TFLOPS
of dedicated Tensor operations. With a peak clockspeed of 1455MHz, this marks
a 42% increase in theoretical FLOPS for the CUDA cores at all size. Whereas
coming from Pascal, for Tensor operations the gains will be closer to 6-12x,
depending on the operation precision."

------
mappu
Are numbers available for FP16 on P100 or FP32 on V100? It would make for a
more direct comparison.

EDIT: Nvidia's advertised TFLOPS are:

    
    
             FP16  FP32  FP64
        V100 30    15    8.5
        P100 21.2  10.6  5.3
        K40  4.29  4.29  1.43

~~~
Tom1971
Your table doesn't include the new "tensor core" TFLOPS of V100.

That's a core that does 4x4 FP16 matrix multiplication + 4x4 FP32 accumulation
in one go.

That's where V100 gets its boost, up to 120 TFLOPS.

------
jakebasile
The GPU looks like a monster, and I am sure it'll deliver more power to AI
applications, but what I really want to know is when I can put one in my
gaming PC.

I think it's great that what started as a specialist gaming device is now
being used in industry for Big Things. The development cost that Nvidia et al.
have invested in new designs has undoubtedly been financed in (large) part by
the gaming community. Now income and advancements for both sectors feed into
the other and gamers like me are reaping the benefits with reduced
price:performance across the range.

~~~
pjmlp
Meanwhile I still remember being seated at the Games Development Conference
talk in 2009 where Intel tried to convince us how Larrabee would change the
world of computing.

Still waiting for them to produce anything worthwhile buying instead of AMD
and NVidia GPUs.

~~~
dr_zoidberg
Actually, it's now called Xeon Phi:
[https://en.wikipedia.org/wiki/Xeon_Phi](https://en.wikipedia.org/wiki/Xeon_Phi)

~~~
pjmlp
I know, they are pretty hard to get and as such largely ignored by HPC and
gamming communities.

~~~
slizard
Hard to get? For better or worse TTBOMK there are more Intel Xeon Phi machines
on the TOP500 than GPU accelerated systems.

~~~
pjmlp
Yes for people with normal bank accounts vs random computer store.

------
iamNumber4
I'm a long time Nvidia user. However, the recent article headlines have been a
word salad.

The new tesla volta super flip flop at 1.21 gigawatts blah blah blah.

Just saying.

Also Kudos to Nvidia for the buzzword/made up word creation for their
products.

~~~
Symmetry
Surely getting people to refer to their vector lanes as "cores" was their
biggest piece of marketing magic. Yes, they're more flexible than SIMD cores
but not more so than an OoO core's execution ports.

------
throwaway87213
I can see why they launched 1080 Ti early. Had I not seen this I'd definitely
not be waiting for Volta.

How are things in the red camp ? There was some HIP thing where Fiji was as
good as Pascal.

~~~
eleitl
Any idea on the timeline for consumer Volta?

~~~
dogma1138
Early 2018 depending on GDDR6 supply. Seems like Hynix will start mass
production on in late 2017.

------
BugsJustFindMe
I understand the pedigree of these cards, but at what point do we stop calling
them GPUs and start calling them something else? MPU? TPU? I don't know, but
isn't it a little bit weird to keep using the word "graphics" for something
that is being made more and more specifically for other things?

~~~
joelthelion
I don't see the point of adding to the confusion. They've been called GPUs for
ever, everybody agrees on the term, why not continue to use it?

~~~
DaiPlusPlus
I like the term "GPGPU" (General-Purpose Graphics Processing Unit) as it's at
least explicit about its general-usefulness.

~~~
intoverflow2
It's the exact opposite of general purpose though, it's several specific
purposes.

~~~
kajecounterhack
No, it's several specific purposes that heavily use the provided _general_
primitives. There's nothing stopping more things from using it, which makes it
general.

You don't call your PC a specific purpose computer just because all you do is
use google chrome.

------
TekMol
Can the Volta architecture be used to run WebGL in a Browser?

------
Aliyekta
cudnn 7?

