Hacker News new | past | comments | ask | show | jobs | submit login
Volta: Advanced Data Center GPU (nvidia.com)
278 points by abhshkdz on May 10, 2017 | hide | past | web | favorite | 158 comments

These tensor cores sound exotic: "Each Tensor Core performs 64 floating point FMA mixed-precision operations per clock (FP16 multiply and FP32 accumulate) and 8 Tensor Cores in an SM perform a total of 1024 floating point operations per clock. This is a dramatic 8X increase in throughput for deep learning applications per SM compared to Pascal GP100 using standard FP32 operations, resulting in a total 12X increase in throughput for the Volta V100 GPU compared to the Pascal P100 GPU. Tensor Cores operate on FP16 input data with FP32 accumulation. The FP16 multiply results in a full precision result that is accumulated in FP32 operations with the other products in a given dot product for a 4x4x4 matrix multiply," Curious to see how the ML groups and others take to this. Certainly ML and other GPGPU usage has helped Nvidia climb in value. I wonder if Nvidia saw the writing on the wall so to speak with Google releasing their specialty hardware called the Tensor hardware that Nvidia decided to use it in their branding as well.

"Tensor hardware" is a very vague term that's more marketing than an actual hardware type, I guarantee you that these are really SIMD or matrix units like the Google tpu that they just devised to call "Tensor", because, you know, it sells.

They're matrix units just like in the Google TPU but the TPU stands for "Tensor Processing Unit" so that's consistent. There's no reason to add special SIMD units when the entire core is already running in SIMT mode and by establishing a dataflow for NxNxN matrix multiplies you can reduce your register read bandwidth by a factor of N. Which isn't as huge for NVidia's N=4 as for Google's N=256 but is still a big deal, and diminishing returns might mean that NVidia is getting most of the possible benefit when stopping at 4 and preserving more flexibility for other workloads.

For me, the laymen, reading the matrix multiply stuff that's what it sounded like to me as well given my understanding of SIMD and such. Especially when they made mention to BLAS. But I am no expert.

Yup, the tpu also, it was just a systolic matrix multiplier, but hey, it's Google, and they called it a "Tensor processor" so let's get a hard on..m

Google's hardware is for inference, not training.

Volta is for both inferencing and training, but has an emphasis on inferencing

thanks for clarifying.

It doesn't matter, operations are the same in forward and backward mode.

"Made for inference" just means "too slow for training" if you are pessimistic or "optimized for power efficiency" if you are optimistic.

Otherwise training and inference are basically the same

You can do inference pretty easily with 8-bit fixed point weights. Now attempt doing the same during training.

Training and inference are only similar at a high level, not in actual application.

... because the gradient that is being followed may have a lower magnitude than can be represented in the lower precision.

You also need a few other operations for training, such as transpose, which may or may not be fast in a particular implementation.

(ETA: In case it's not obvious, I'm agreeing with david-gpu's comment, and adding more reasons that training currently differs from inference.)

It's really cool how much performance you can get out of hardware dataflows.

More great hardware being stuck behind proprietary CUDA when OpenCL is the thing they should be helping with. Once again proprietary lock in that will result in inflexibility and digital blow-back in the long run. Yes I understand OpenCL has some issues and CUDA tends to be a bit easier and less buggy, but that doesn't detract from the principles of my statement.

I am the author of DCompute, a compiler/library/runtime framework for abstracting OpenCL/CUDA for D. You can write kernels already, although the API automation is still a work in progress. I'm hoping that this should level the field a bit, because let's face it, people use CUDA for two reasons: the OpenCL driver API sucks; and the utility libraries (cuDNN et al) for CUDA. Possibly driver quality as well.

By having an API thats not horrible to use, that advantage is gone. The utility libraries will be more of a challenge to undermine, but since it targets CUDA natively there is no disadvantage to users of nvidia's hardware, but there is no advantage to others, yet (see GLAS[1] for what is possible with relative ease). Using D as the kernel language will also bring significant advantages over C/C++, static reflection, sane templates and compile time code generation to name a few.

You can find it at https://github.com/libmir/dcompute.

If you have any question, please ask!

[1] https://github.com/libmir/mir-glas

^ This!

Please read this before moving on: https://twitter.com/jrprice89/status/667466444355993600

Also, NVIDIA's CUDA compilers are built on clang which does have OpenCL frontend, so all they would need to do is to put some resources into making that frontend work with their current nvcc toolchain.

Many request and want this, but instead they are trying hard to hold back OpenCL just because providing OpenCL 2.0 support (and extensions for their GPUs features) may help adoption of OpenCL which in turn may end up helping other folks and companies too.

Nobody else is even bothering to compete, so standards don't really matter. Let them do their job: I'd rather have faster GPUs.

Standards matter if you care about software and hardware freedom.

You really don't have freedom if there are no legitimate competitors.

But accepting a monopoly standard guarantees loss of freedom and rules out any future prospect of legitimate competition.

Successful open systems tend to have standards that "just happened", not prescriptive standards. Like x86 PCs or Linux or most programming languages.

Cuda seems to be clearly winning over OpenCL in the real world so other vendors should just adopt it. AMD already has a CUDA compiler IIRC.

The point where you suggested that x86 PCs are "open systems" (listing it next to "Linux" of all things!) I realized that you don't get it. We are where we are with Intel ripping off the consumers and companies alike exactly because nobody realized that x86 is everything but open.

A similar mistake is about to happen, but luckily on the software side where losses can be cut quick and mistakes can be reversed easier -- though many will suffer when they have to reimplement their precious library from ground up because they did (or could) not take into account the fact that CUDA is as proprietary as it gets.

AMD has no CUDA compiler BTW. And CUDA is not a programming language FYI. ;)

I'm pretty sure I referred to open systems and CUDA in their commonly understood meanings. Here are some links that may help to clarify the concepts:



Aside: I have no position on whether is CUDA's Fortran and C++ dialects constitute their own languages, nor did I refer to CUDA as a programming language.

> http://www.pcmag.com/encyclopedia/term/48478/open-system

Sadly, that's a very problematic, borderline BS definition.

"A system that allows third parties to make products that plug into or interoperate with it. For example, the PC is an open system."

Intel allows some third parties to interoperate with their system (ref Intel vs NVIDIA etc.) and they pick and choose to their liking, kill some and promote others exactly because they control the open-ness of their systems.

> http://www.anandtech.com/show/9792/amd-sc15-boltzmann-initia...

HIP is still not a CUDA compiler. http://gpuopen.com/compute-product/hip-convert-cuda-to-porta...

> nor did I refer to CUDA as a programming language.

You did refer to "CUDA compiler". My comment was admittedly a nitpick as well as a serious point too. CUDA can be seen as a C++ language + extensions -- something you can compile --, but it's also more than that (stuff that you can't compile), e.g.: API, programming tools, etc. all strongly adapted for NVIDIA hardware.

That article doesn't say AMD wanted to support CUDA, they wanted to give tools to migrate to HCC ("AMD's CUDA").

That would be one way, but what's commendable is that they went further and HIP is actually also a common thin API on top of CUDA and their own platform. They could've just stopped at converting code, but they did not -- and that's something that might save them and give people enough incentive to support their products. You can keep your NVIDIA path that'll be compiled with the nvcc backend and target both platforms with the nearly the same code, especially on host (and often also device side).

I don't wish you the suffering vendor lock-in can cause after 10 years (hell, even less) of faithfully following the NVIDIA path, but... actually I do because that probably the best way to realize what's wrong with proprietary systems that pitch themselves as "de-facto" standards.

Build your systems around GEMM/blas. Every vendor will give you a fast GEMM, and you'll be set for basically all the architectures that are coming out.

Except that not all problems in computation are GEMMs. CNNs in Machine learning certainly are, but many 'real' systems cannot be posted in such a manner.

In supercomputing this is the problem with using high performance linpack for benchmarks, which typically exceeds actual scientific codes by an order of magnitude in terms of floating point operations per second.

Yes but to the extent you can, it's an easy win. I switched to a GEMMable method for a preprocessing step today based on the Volta and recent TPU news.

Hopefully Tensorflow XLA or other optimization frameworks could solve this problem in a more general way in the medium term:


I thought NVIDIA GPUs support OpenCL? Or do they not do that anymore?

It's always been 10-20% slower than CUDA and frankly NVIDIA doesn't have an incentive to make it faster than that.

On the other hand, I believe Google is working on a CUDA compiler [1] so we may actually see meaningful improvement in the sense that it may become possible to run CUDA on other GPUs. (Edit: And Google actually has an incentive to achieve performance parity, so it might really happen.)

[1]: https://research.google.com/pubs/pub45226.html

> On the other hand, I believe Google is working on a CUDA compiler [1]

Hi, I'm one of the developers of the open-source CUDA compiler.

It's not actually a separate compiler, despite what that paper says. It's just plain, vanilla, open-source clang. Download or build the latest version of clang, give it a CUDA file, and away you go. That's all there is to it.

In terms of compiling CUDA on other GPUs, that's not something I've worked on, but judging from the commits going by to clang and LLVM, other people are quite interested in making this work.

Interesting, appears to have been merged upstream:


But it still targets NVIDIA GPUs and uses NVIDIA libraries so not that universal yet.

"It's always been 10-20% slower than CUDA"

This is an untrue, yet often repeated statement. For example Hashcat migrated their CUDA code to OpenCL some time ago, with zero performance hits. What is true is that Nvidia's OpenCL stack is less mature than CUDA. But you can write OpenCL code that performs just as well as CUDA.

It has historically been slower for neural networks, especially considering the lack of a CuDNN equivalent.

Also the opposite can be true as well (>2x slower); e.g try to rely heavily on shuffle.

What is Hashcat and why should we care?

A password cracking utility, and because it was put forth as at least one example of a real-world application purported to perform just as well under OpenCL as CUDA. If true, it provides evidence against the claim "[OpenCL]'s always been 10-20% slower than CUDA".

Because its a performance critical application that has made the switch so is a good comparison.

As someone said, we already merged it upstream. :)

Nowadays our CUDA compiler is just clang

10-20% slower seems an honest delta, I can't blame a company for working more on their desires/ideas if they provide a standardized non crippled solution.

> It's always been 10-20% slower than CUDA and frankly NVIDIA doesn't have an incentive to make it faster than that.

Incorrect. Our kernels (GROMACS molecular simulation package) are 2-3x slower implemented in OpenCL vs CUDA.

> On the other hand, I believe Google is working on a CUDA compiler

They were. It's upstream clang by now.

Can Vulkan fill that space?

Nvidia will have to support the SPIR-V Vulkan environment that is different to the OpenCL SPIR-V environment. But Vulkan is a graphics API not a compute API. Yes in theory you can write compute shaders but from my experience if you have a compute workload: use a compute API, they're much more suited for the job.

So, no.

I find it so cool that technology created to make games like Quake look pretty has ended up becoming a core foundation of high performance computing and AI.

I think it's even cooler how matrix multiplication dominates both the universe at large, and the systems that understand it (neural networks).

Well a large portion of that is desiring the data to be in that form. BLAS operations are ruthlessly efficient and use the system hardware so well.

Linear algebra is the ultimate common variable in technical computing and applied mathematics at large.

I find it incredible that even with all these cool applications of matrix multiplication it gets taught so horribly in schools.

Max Tegmark keeps saying interesting things lately. Controversial for sure, but interesting. His book 'Our Mathematical Universe' (related to the article you've linked) is thought-provoking, and I would label it a must read if you're interested in what's going on at the outer edges of fundamental science. The chapters are clearly separated into: fact, hypothesis, and far-out speculation, so there's no need to criticize the whole thing indiscriminately.

There was a series of attempts by Lee Smolin and others to come up with a theory of quantum gravity by assuming that the universe, at the bottom, is essentially simple and discrete (not in the fixed-grid sense, but in the sense of a discrete web of relations). That model also exhibits a remarkable similarity between the structure of the universe, and the structure of the neural networks that understand it.

The future of fundamental science is sure to be fascinating.

I think that's a dreadful article. We do know how neural networks work; they're a bunch of hierarchical probabilistic calculations that are pipelined. I don't really see how that couldn't work well; it's just hard to find the right probabilities. The difficulty is far more in the training than the working, and that's where the deep learning advances come in - inferring more parameters in a deeper hierarchy.

There's no relationship between a hierarchy of probabilistic estimations and a hierarchical decomposition of the cosmos. The cosmos forms an apparent hierarchy because of the rules that govern matter and the initial expansion of the universe. That a small number of parameters might be listed in describing both is neither here nor there. A small number of parameters describe the vectors in a font file. It doesn't follow that a typeface then has any relationship with my brain or the universe.

The article reads, to me, like this: neural networks are this cool hierarchy thing, the cosmos is this cool hierarchy thing, and both of these things have low Kolmogorov complexity, isn't it amazing that our brains are like this and can understand the universe, wow.

> a bunch of hierarchical probabilistic calculations that are pipelined

That's one way of describing quantum theory; generally "contextual" or "non-commuting" are used instead of "hierarchical".

If the universality of such a common framework doesn't seem profound to you, at least realise it isn't something generally appreciated and barely even hinted at just a few decades ago.

It backwards: first common application of the technology gave it the name.

Like in one of the Stanisław Lem's stories about Ijon Tichy people call intelligent anthropomorphic robots washing machines.

You'll be delighted to hear that traffic signals are called "robots" in South Africa.

That itself reminded me, that the word "robot" comes from "robota, rabota" which means "to work, worker" in a lot of slavic languages (My native language is bulgarian).

From the dictionary:

robot, origin: from Czech, from robota ‘forced labor.’ The term was coined in K. Čapek's play R.U.R. ‘Rossum's Universal Robots’ (1920).


Rossum is a riff on "rozum" which means reason (as in to reason about).


"Thinker's Universal Workers"?

Yet another step in the progression in which mass market GPU silicon kills traditional vector and memory bandwidth rich4 HPC/supercomputing hardware. Cray-on-a-chip.

Edit: traditional vector machines like the nec sx still hold the programmability crown because you get a usable single system image, right?

Matrix multiplication is important for graphics and important for finding the weights of a neural network

Yep, hard to imagine though that the original creators of the Nvidia TNT or Voodoo had any idea that GPUs would become fully programmable computing hardware used for non-graphical applications.

Creators of Voodoo (3dfx = Gary Tarolli, Scott Sellers) came from the world of fully programmable GPUs. Silicon Graphics workstations had full T&L since ~1988 (http://www.sgistuff.net/hardware/systems/iris3000.html).

The whole point of Voodoo 1 was making it as simple and cheap as possible by removing all the advanced features and calculating geometry/lighting on the CPU.


Iris Graphics Geometry Engines weren't programmable in the modern sense. There was a fixed pipeline of Matrix units, clippers and so on that fed the fixed function Raster Engines. You could change various state that went into the various stages, but the pipeline's operations were fixed.

Later SGI Geometry Engines used custom, very specialized DSP-like processors, but the microcode for those were written by SGI, and not end-user programmable.

There were probably research systems before it, but AFAIK the Geforce 3 was the first (highly limited) programmable geometry processor that was generally commercially available.

Uhm, weren't their later graphics systems heavily based on i860 processors?

Yes, later REs were i860s.

I don't think they'd have been super surprised. Just pleasantly happy.

AI Accelerators have been a thing for decades - DSPs were used as neural network accelerators in the early 90s - and Cell processors were a thing by 2001.

GPUs just became vastly more accessible to general purpose program in the last decade. People were doing it back in the 90s but it was seriously hard.

We finally hit a tipping point where it's just kinda hard.

There were also the various custom "systolic array" processor designs in the 1980s (the ALVINN vehicle, and earlier projects which led to it, used these for early neural-network based self-driving experiments).

I remember back in 2004 when I heard a fellow grad student was working on using GPUs as a co-processor for scientific computing, I though "Wow, that's esoteric and niche."

This reminds me of a comment i read here ages ago about a scientist using the "processors" of the univercity's postscript printers because they did the work much faster than their scientific workstations.

Reminds me of some Commodore 64 programs running code on the 1541 disk drive to offload computation from the main CPU (both the C64 and the 1541 had 6502s (well, the C64 had a 6510 which had an I/O port) running @ 1Mhz). The original Apple Laserwriter had a 68k running at 12Mhz, while the Mac Plus, which came out almost a year later, had its 68k running at 8Mhz.

Wow, this is just Nvidia running laps around themselves at this point. Xenon Phi still not competitive, AMD focused on the consumer space, looks like the future of training hardware (and maybe even inferencing) belongs to Nvidia. (Disclosure: I am and have been long Nvidia since I found out cudnn existed and how far ahead it was)

>Xenon Phi still not competitive, AMD focused on the consumer space, looks like the future of training hardware (and maybe even inferencing) belongs to Nvidia.

Assuming there's a big future to training hardware and inferencing. Many of those "new paradigms" / "silver bullet technologies" have come and gone in the last decades.

That's true, but there is reason to believe this time is different™, with killer applications in medical image understanding, natural language understanding, and self driving cars, all of which could drive demand of these chips by themselves. It is possible we will discover new dominant architectures that don't use this hardware well but I am putting my money on us coming up with even more applications that do use this hardware well.

There's something coming for them: deep learning processors.

I'm biased, since I'm part of one, but there's little to no modification of the software stack necessary, so it's a credible threat to nvidia.

I hope so, if only because it keeps them running at this pace! Kudos for charging the 800lb gorilla head on.

What do you think about them open sourcing DLA of Xavier?

They haven't released enough info on it, what exactly are they open sourcing? The chip design?

I'm wondering myself. Maybe just the software to use it? No idea...

> I am and have been long Nvidia

Today was a great day to be!

The potential disrupter here is RISV-V with vector extensions, which are currently being standardized.

815 mm^2 die size!

That's at the reticle limit of TSMC, a truly absurd chip.

I agree... there's not much more they can do to scale since off die is still slow. Unless they stitch across the exposure boundary!

However, they have been at the reticle limit since they were in 28nm. GM200 (980 Ti and Titan X) was 601 mm^2 at TSMC... the maximum possible at the time.

I've seen some huge mainframe die back in the day. What is reticle limit exactly? Thanks for educating a SW guy :)

Part of the chipmaking process is burning layers into wafers covered in photoreceptive material. Photomasks/reticles used to cover entire wafers making many units at once, but now the processes are so small they have to compress the image (4-10 times is typical), burn a couple units, step over repeat on the same wafer. This GPU is so large, they can only fit 1 of them in a single burn step.

It's something along the lines of the film size for the super fancy camera they use in one of the steps. (The silicon wafer would be the equivalent of the entire roll of film.)

193i immersion steppers,a la ASML have 32x26 as the reticle limit

This is odd for NVIDIA. They usually push out revised versions in the second year, not change the entire architecture to the new one.

Feels like they're feeling AMD breathing down their necks with their VEGA architecture, which should be very interesting.

AMD have also stepped up their game with ROCm which might take a chunk out of CUDA.

As I recall, Volta (3d memory) has been delayed multiple times due to supply and this is only a very limited release of their highest end hardware for deep learning all pegged for Q3/Q4 release. A field where they haven't really any competition.

Can't imagine we will be seeing any Volta GeForce cards released till next year.

Volta GeForce will come early 2018 likely with GDDR6 at this point.

I wonder if the individual lane PCs will pave the way for implementing some of Andy Glew's ideas for increased lane utilization in future revisions?


What are the silver boxes that line both sides of the card? Huge Capacitors?

Ferrite chokes, part of the power delivery system.

Inductor, not chokes. Part of the buck converter to create Vcore.

Why are they needed?

For the same reason as around any other CPU or GPU and lots and lots of other chips: buck converter, i.e. 12V 20A DC in, 1.2V 200A DC out.

It's part of a step down voltage regulator called a buck converter. The buck converter works by putting a pulse of energy into the inductor and stretching it out to lower the voltage. This creates the core voltage.

To get rid of electrical noise.

im assuming chip draws yuuge power

You're not wrong. 300W, holy shit.

Time to play some games on it

I have a feeling eventually Nvidia will, like Intel, de-prioritize the consumer market in favor of the much more profitable server/machine learning market.

Gaming GPUs still more than 50% of revenue for NVIDIA:


And shrinking.

I'ved thought that but the per unit volume is huge. Every game console, phone, tablet, PC needs a GPU. Even low-end devices are expected to run games. Thats billions of units, albeit at lower margins

Most of those aren't NVIDIA, though.

Ironically, most of them actually use AMD's IP (the "Adreno" GPU, which is an anagram of "Radeon") that they sold off to Qualcomm in 2009. Which was yet another terrible call made by AMD management in that timeframe.

(although who knows if Adreno would have blown up in the same way if it had AMD mismanaging it)

Even more ironically, Adreno also used tile-based rendering that NVIDIA ended up adopting in the Maxwell architecture and AMD is adopting in the Vega architecture. It's a nice way to boost your power efficiency, which is critical to battery life in mobile devices.

Turns out since we're past Dennard scaling, packing more transistors on a chip now makes it hotter. So if you want it to go faster, you need to cut the power down in other ways. And thus, desktop GPUs are starting to look an awful lot like mobile GPUs...

(which is yet another reason why AMD's general-purpose compute-oriented GPU architectures are losing so badly in the desktop graphics market. RX 580 pulls twice the power of a GTX 1060 for the same performance...)

Only in furmark, for non power virus work loads it's ~20w more on average.

No, in gaming it's literally twice the power consumption of a GTX 1060.


Many other aftermarket 580s are similar. For a sense of perspective here, that's roughly the same amount of power as some aftermarket 290Xs used. Or roughly 60 watts more than a GTX 1080. And that's GPU-only, not a total system load.


Polaris 10 is a reasonably efficient chip when you don't push it too hard. AMD - and their AIB partners - are pushing it way, way too hard in a desperate attempt to eke out a 2% win over the 1060. It isn't worth a 50% increase in TDP to get an extra 8% performance.

(and unlike the RX 480 - there is no reference RX 580 design, it's a whole bunch of these crazy juiced-up cards)

Reference card (1060) vs overlocked card (580). Looking at multiple review sites, including international ones like Computerbase, PCGH.de etc, and comparing overlocked 1060 vs overclocked 480/580 the difference is ~50-60 watt.

Not good, but also not twice the power consumption...

I don't understand why AMD didn't use faster memory in the 580 like Nvidia did with the 1060 refresh. The 580 needs faster memory more than higher core clocks.

Lets hope AMDs return to tile based rendering (used in Adreno) plus the other improvements help them get better at power consumption just like Nvidia with Maxwell. But I don't expect much from Vega after AMDs GPUs of the last 3 years. Navi looks more promising, as it is probably the first GPU to be fully designed under Raja Koduri.

And practically every game console, phone, tablet and vast majority of PCs are running integrated GPUs. Integrated GPUs that are not nvidia. Unless NV gets into the licensing market, the growth potential for them seems somewhat limited.

Well, the Nintendo Switch is an NVIDIA tablet, so there's that. It's selling like hotcakes, if you can even find one.

Well, they are trying to get into CPU market since late 2000s to sell their own integrated chips.

I mean, at what point will we go full circle of going back to a "mainframe" where consumers don't really own/posses the computing power, rather it's down in datacenters. Like, you play your game through a VM basically, and your personal computer is just an AWS instance...

GeForce Now[0] - a VM you connect to from your PC, install games and stream.

GeForce Now for SHIELD[1] - Different model, more like "netflix for games"

[0]: http://www.geforce.com/geforce-now [1]: https://www.nvidia.com/en-us/shield/games/geforce-now/

Pretty much already there.

They've done quite well on the Nintendo Switch.


Not necessarily. So many of the improvements would anyway have a dual use, and it's not like their margins in the gaming/end user GPU business are razor thin. Moreover, the volume is probably immensely higher, so despite lower per-unit profit, they probably make it up in quantity. Won't be like this forever given the different speeds at which the 2 sectors grow, but it's gonna be some time before the roles are reversed.

Perhaps, but the desktop gaming market is still growing and is a huge part of NVIDIA's income.

Isn't that what this post was all about? Releasing brand new architecture on compute first seems to me pretty much like prioritizing compute market over consumers.

Citation needed for the "much more profitable" part.

Nah. This is NVIDIA. They will just continue to focus on both markets as long as they're kicking ass in them.

It can only play Crysis on 50% texture.

I know you're getting downvotes, but in the Keynote they showed a cinematic-quality live rendered "gaming demo" scene

For those wondering, this was (around) the 44 minute mark.

I was wondering if this will be used in supercomputers. Apparently yes:

> Summit is a supercomputer being developed by IBM for use at Oak Ridge National Laboratory.[1][2][3] The system will be powered by IBM's POWER9 CPUs and Nvidia Volta GPUs.


Summit is supposed to be finished in 2017, though. I'm quite surprised this is possible since the Volta architecture has only just now been announced.

The Summit contract was signed in November 2014: http://www.anandtech.com/show/8727/nvidia-ibm-supercomputers

Supercomputers have very long planning and development cycles. So do GPUs and CPUs. The contract specified chips that didn't yet exist (Volta and POWER9) as much more than codenames on a roadmap.

I'm really happy our startup didn't go all in on Tesla (Pascal architecture) yet. These look amazing.

I feel like every time I buy cards, Nividia announces the successor with absurd improvements.

Yeah, I just sprung for a Titan Xp -- waiting for it to become obsolete next month.

Well, close to already if you are looking at $$/comprable performance, with the 1080ti

The Titan Xp (with lowercase p, as opposed to the Titan XP) came out after the 1080 Ti so I'm sure GP took the latter into consideration before making a decision...

Yep. I'm not sure it was worth the extra $$ for the extra specs just yet. We'll see when we SLI it.

The issue though is no memory sharing with the GTX/Titan line. If that were the case, I probably just would have sprung for two 1080Tis out the gate.

Definitely loving the eight 1080Tis they just fit in here though: http://www.velocitymicro.com/promagix-g480-high-performance-...

OTOH improvements in the mainstream segments seem to go slower: Mainstream cards are about twice as fast now as they were five years ago.

Cuda should run on both, right? Unless you're talking about shader assembly or hardware.

So when are the new AWS instances are coming?

FTA: "GV100 supports up to 6 NVLink links at 25 GB/s for a total of 300 GB/s."

The math doesn't add up.

Bidirectional bandwidth.

Maybe 25 GB/s each way?

That's what I thought too, but then why would they quote unidirectional b/w in one part of the sentence, and bidirectional in the other?

Bandwidth of a single link (which is unidirectional) versus aggregate bandwidth of all links.

Interesting to note that Nvidia's stock rose about 18% (!, 102.94USD on May 9, 121.29USD on May 10) in a single day after this announcement. I expected the market to react, but this seems disproportionate.

They announced this the day after earnings, earnings caused the jump, this compounded (maybe).

My favorite outcome of Volta is that it's the first GPU they've produced that actually can claim this SIMT thing due to its separate program counters (we had a spirited debate about whether or not just doing masking but presenting the programming model meant the chip was SIMT or just that CUDA was but GPUs weren't).

Does this architecture improve on 64-bit integer performance? Have any of the GPU manufacturers said anything about that? At some point it becomes a necessity for address calculations on large arrays.

"With independent, parallel integer and floating point datapaths, the Volta SM is also much more efficient on workloads with a mix of computation and addressing calculations"


Under "New SM" in "Key Features" section

But if you read the article it seems the integer units are int32, so not capable of 64-bit computations.

Did they communicate any release date and price during the show ?

DGX-1 with Volta — $149k, Q3; DGX Home Station with Volta — $69k, Q3

Any information about when this architecture will make it onto Tesla or Quadro products available to "mass" market?

I think Jensen mentioned this would be available with OEMs Q4 onwards

How long until Tesla sues for trademark infringement? "from detecting lanes on the road to teaching autonomous cars to drive" makes it sound like there is an awful lot of overlap in product function.

I doubt anything like that would happen. While Tesla Motors was founded prior to the creation of the Tesla GPU architecture, there's not really any overlap - in fact, I wouldn't be surprised if Tesla Motors wasn't using something like this from NVidia:


As far as any overlap software-wise is concerned, while it isn't super clear what Tesla Motors is doing for their self-driving systems, based on what I've seen it seems like they are using only "basic" lane-detection and identification along with some other algorithmic vision-based systems. I'm not saying that's everything they are doing, just what I have seen released publicly on their vehicle platform.

NVidia, on the other hand, has been experimenting with using neural networks (deep learning CNNs specifically) to drive vehicles using only camera information:


This is actually a fun CNN to implement - I (and many others) implemented variations of it in the first term on Udacity's Self-Driving Car Engineer Nanodegree. We weren't told to do it this way, but I chose to do so after reviewing the various literature, plus it seemed like a challenge (and it was for me). Udacity supplied a simulator:


...and we wrote code in Python (Tensorflow and Keras) to train and drive the virtual car. For my part, I had set up my home workstation with CUDA so that Tensorflow would utilize my GPU (a lowly GTX 750 TI SC - though it seems like it might have a similar GPU capability as NVidia's Drive-PX system, based on what I've researched - a Mini-ITX mobo, a PCI-E slot riser, and a GTX 750 would make a decent low-end deep-learning platform for self-driving vehicle experiments, and cost a fraction of what the Drive-PX sells for).

Tesla Motors uses Tegra chips to power their console. So, nVidia is probably okay.

What hardware do you think Tesla is using...?

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact