
Nvidia CEO Introduces Nvidia Ampere Architecture, Nvidia A100 GPU - bcaulfield
https://blogs.nvidia.com/blog/2020/05/14/gtc-2020-keynote/
======
drewg123
I'm confused. Is there any relationship between the recent Ampere Arm64
servers
([https://news.ycombinator.com/item?id=22475036](https://news.ycombinator.com/item?id=22475036))
and Nvidia's "Ampere Architecture", or is it just a case of them using the
same name?

~~~
dijit
I don't like that people downvoted you for asking a question.

If someone thinks the question is stupid or not doesn't mean that a downvote
is warranted. (nor an upvote, answer the question and move on.)

To answer though; it's just a coincidence, as you might already know Nvidia
uses famous scientists (especially in the field of electricity) as the names
of their microarchitectures.

* Volta (Alessandro Volta, inventor of the electric battery)

* Tesla (inventor/designer of A/C current)

* Maxwell (James Clerk Maxwell, founder of electromagnetic radiation)

* Pascal (Blaise Pascal, lots of science around "pressure", arguably his work led to the creation of vacuum tubes used in early computers)

Ampere (from André-Marie Ampère, who lent his name to his discovery and
classification of "amps") is just an electrical scientists name.

Coincidentally a new company founded in 2017 decided that it was a good name
for them, and thus the confusion.

~~~
bhouston
And even before that there was:

* GeForce (for Andrea Geforce, the first to use the color electric green)

* Riva (for Jose Riva, the discoverer that you can use TNT to generate electricity.)

~~~
smabie
Who is Andrea Geforce? I can't find any information about him/her.

~~~
capableweb
Wikipedia tells the following about the Geforce name:

> The "GeForce" name originated from a contest held by Nvidia in early 1999
> called "Name That Chip". The company called out to the public to name the
> successor to the RIVA TNT2 line of graphics boards. There were over 12,000
> entries received and 7 winners received a RIVA TNT2 Ultra graphics card as a
> reward.[2][3]

\-
[https://web.archive.org/web/20000608011648/http://www.nvidia...](https://web.archive.org/web/20000608011648/http://www.nvidia.com/namingcontest)

\- [https://tweakers.net/nieuws/1967/nVidia-Name-that-chip-
conte...](https://tweakers.net/nieuws/1967/nVidia-Name-that-chip-contest.html)

So not sure the origin is actually from Andrea Geforce, I can certainly not
find any sources that confirm that.

------
tbenst
Important to remember the half-precision tensorcore misrepresentations where
the 8x improvement over fp32 claimed on Imagenet with tensorcores (V100) was
actually only 1.2-2x [1,2]. Furthermore, there are major precision issues with
network architectures like variational autoencoders and many others.

We use V100s for Richardson-Lucy like deconvolutions for example, where we
have near-exact photon counts up to 10,000 per pixel. fp32 is sufficient, tf32
is not.

V100 claimed 15 teraflops of FP32, A100 claims 19.5 teraflops. For most
pytorch/tensorflow workflows out there, where FP32 dominates, this
approximates closer to 30% improvement of last generation, which is reasonable
and typical. Although FP64 does get a nice boost.

[1] [https://lambdalabs.com/blog/best-gpu-tensorflow-2080-ti-
vs-v...](https://lambdalabs.com/blog/best-gpu-tensorflow-2080-ti-vs-v100-vs-
titan-v-vs-1080-ti-benchmark/) [2]
[https://www.pugetsystems.com/labs/hpc/TensorFlow-
Performance...](https://www.pugetsystems.com/labs/hpc/TensorFlow-Performance-
with-1-4-GPUs----RTX-Titan-2080Ti-2080-2070-GTX-1660Ti-1070-1080Ti-and-
Titan-V-1386/)

~~~
XCSme
I am not that much into ML, just fiddled with it a bit, is tf32=fp16?

~~~
tbenst
Not quite, but close. “tf32” is 18 bits, but with the same 10 bits of exponent
that fp32 has. It’s the range of fp32 with the precision of fp16. It’s a shame
to see such unoriginality in new number representations. I’d much rather see
Posit hardware acceleration:
[https://web.stanford.edu/class/ee380/Abstracts/170201-slides...](https://web.stanford.edu/class/ee380/Abstracts/170201-slides.pdf)

~~~
rajnathani
TF32 is 19 bits, not 18 bits. There's an additional bit for sign.

[https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-prec...](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-
format/)

------
xvilka
Probably even more closed than ever. They tend to become more and more
restrictive with every new hardware generation. I wonder where their promised
open source announcement they preannounced before.

~~~
wegs
Yeah. That's been my general problem with adopting NVidia for anything. They
make good hardware, but there's a lot of lock-in, and not a lot of
transparency. That introduces business risk.

I'm not in a position where I _need_ GPGPU, but if there wasn't that risk, and
generally there were mature, open standards, I'd definitely use it. The major
breakpoint would be when libraries like Numpy do it natively, and better yet,
when Python can fork out list comprehensions to a GPU. I think at that point,
the flood gates will open up, and NVidia's marketshare will explode from
specialized applications to everywhere.

Intel stumbled into it by accident, but got it right with x86. Define an
open(ish) standard, and produce superior chips to that standard. Without AMD,
Cyrix, Via, and the other knock-offs, there would be no Intel at this point.

Intel keeps getting it right with numerical libraries. They're open. They work
well. They work on AMD. But because Intel is building them, Intel has that
slight bit of advantage. If Intel's open libraries are even 5% better on
Intel, that's a huge market edge.

~~~
fluffything
> They make good hardware, but there's a lot of lock-in, and not a lot of
> transparency.

This sounds like you'd like NVIDIA to open-source all their software. I see
this type of request a lot, but I don't see it happening.

NVIDIA's main competitive advantage over AMD and Intel is its software stack.
AMD could release a 2x powerful GPGPU tomorrow for half the price and most
current NVIDIA users wouldn't care because what good is that if you can't
program it? AMD software offer is just poor, of course they open-source
everything, they don't make any software worth buying.

ARM and Intel make great software (the Intel MKL, Intel SVML, ... libraries,
icc, ifort, ... compiler), and it doesn't open-source any of that either for
the same reasons as NVIDIA.

Intel and NVIDIA employ a lot of people to develop their software stacks.
These people aren't probably very cheap. AMD strategy is to save a lot of
money in software development, maybe hoping that the open-source communities
or Intel and NVIDIA will do it for free.

I also see these requests that Intel and NVIDIA should open-source everything
together with the explanation that "I need this because I want to buy AMD
stuff". That, right there, is the reason why they don't do it.

You want to know why NVIDIA has 99% of the Cloud GPGPU hardware market and AMD
1%? If you think 10.000$ for a V100 is expensive, do the math on how much does
an AMD MI50 costs: 5000$ for the hardware, and then a team of X >100k$
engineers (how much do you think AI GPGPU engineers cost?) working for N years
just to play catch on the part of the software stack that NVIDIA gives you
with a V100 for free. That goes into multiple million dollars more expensive
really quickly.

~~~
adev_
> AMD could release a 2x powerful GPGPU tomorrow for half the price and most
> current NVIDIA users wouldn't care because what good is that if you can't
> program it?.

Correction: Nobody will be able to use the AMD hardware (outside of computer
graphics) because everybody has been locked-in with CUDA on Nvidia. They can
not even change even if they want to: it is pure madness to reprogram an
entire GPGPU software stack every 2 years just to change your hardware
provider.

And I think it will remain like that until NVidia get sued for anti-trust.

> ARM and Intel make great software [..] doesn't open-source any of that
> either for the same reasons as NVIDIA.

That's propaganda and it's wrong.

Intel and ARM contribute a lot to OSS. Most of the software they release
nowadays is Open Source. This includes compiler support, drivers, libraries
and entire dev environment: mkl-dnn, TBB, BLIS, ISPC, "One", mbedTLS.... ARM
has even an entire foundation only to contribute to OSS
([https://www.linaro.org/](https://www.linaro.org/)) .

Near to that, NVidia does close to nothing.

There is no justification to NVidia's attitude related to OSS. It reminds me
the one of Microsoft at its darkest days.

The only excuse I can see to this attitude is greed.

I hope at least they do not contaminate Mellanox with their toxic policies.
Mellanox was an example of successful Open Source contributor/company (up to
now) with OFabric
([https://www.openfabrics.org/](https://www.openfabrics.org/)). It would be
dramatic if this disappear.

~~~
amelius
> Nobody will be able to use the AMD hardware (outside of computer graphics)
> because everybody has been locked-in with CUDA on Nvidia.

But numpy can be ported. So can pytorch.

I don't think the lock-in is that big of an issue. GPUs do only simple things,
but do them fast.

~~~
lumost
Part of Nvidia's advantage comes from building the hardware and software side
by side. No one was seriously tackling GPGPU until Nvidia created Cuda, and if
you look at the rest of the graphics stack Nvidia is the one driving the big
innovations.

GPUs are sufficiently specialized in both interface and problem domain that
GPU enhanced software is unlikely to appear without a large vendor driving
development, and it would be tough for that vendor to fund application
development if there is no lock in on the chips.

which leads to the real question. What business model would enable GPU/AI
software development without hardware lock-in? Game development has found a
viable business by charging game publishers.

~~~
diffrinse
Would you agree that that your observations somewhat imply that a competitive
free market is not a fit for all governable domains (and don't mistake
governable for government there, we're still talking about shepherding of
innovation)?

~~~
fluffything
Early tech investments are risky, but if your competition has tech 10 years
more advanced than yours, there is probably no amount of money that would
allow you to catch up, surpass, and make enough profits to recover the
investment, mainly because you can't buy time, and your competitor won't stop
to innovate, they are making a profit and you aren't, etc.

So to me the main realization here is that in tech, if one competitor ends up
with tech that's 10 years more advanced than the competition, it is basically
a divergence-type of phenomenon. It isn't worth it for the competition to even
invest in trying to catch up, and you end up with a monopoly.

~~~
lumost
This is a good callout, unlike manufacturing the supply chain is almost
universally vertically integrated for large software projects. While it's
possible to make a kit car that at least some people would buy, most of the
big tech companies have reached the point of requiring hundreds of engineers
for years to compete.

Caveat that time has shown that the monopolies tend to decay over time for
various reasons, the tech world is littered with companies that grew too
confident in their monopoly.

\- Cisco \- Microsoft Windows \- IBM

etc.

~~~
fluffything
The problem with vertically integrated technology is that if a huge
advancement appears at the lowest level of the stack that would require a
whole re-implementation of the whole stack, a new startup building things from
scratch can overthrown a large competitor that would need to "throw" their
stack away, or evolve it without breaking backward compatibility, etc.

Once you have put a lot of money into a product, it is very hard to start a
new one from scratch and let the old one die.

------
ethbro
In my experience, Jensen Huang's keynotes are unprofessional in the best
possible way.

I remember thinking during an entire GTC presentation "Wait, _this_ guy is the
CEO?"

He seemed like an excited engineer who happened to stumble onto stage.

~~~
tasogare
> He seemed like an excited engineer

Is this a bad thing? I personally avoid presentations made by CEO of big
corporations as it’s usually a Bingo of all the trendy buzzwords that as _in
fine_ no meaning.

~~~
ethbro
Yes and no.

When a kludged together demo fails and there's an awkward moment, sometimes a
little more prep might be nice.

But on the other hand, I feel better about a company focused on doing actual
work rather than polishing demos.

~~~
dahart
Tech demos are notorious for failing on stage. I can’t even think of a well
known CEO who hasn’t had one happen. I don’t think this has a thing to do with
Jensen’s style.

------
2OEH8eoCRo0
Better breakdown of this architecture compared to previous architectures here:

[https://www.anandtech.com/show/15801/nvidia-announces-
ampere...](https://www.anandtech.com/show/15801/nvidia-announces-ampere-
architecture-and-a100-products)

~~~
wmf
Even more details: [https://devblogs.nvidia.com/nvidia-ampere-architecture-in-
de...](https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/)

------
QuixoticQuibit
If the demonstrated speed ups translate to real world performance, then I’m
truly blown away. Looks like Nvidia will be holding onto the AI crown a while
longer.

The only thing I wonder is how difficult is it to take advantage of some of
the new arch features, such as TF32 format or sparsity tensor ops.

~~~
sk0g
Have you looked at Apex.Amp? TF32 sounds to be along similar lines, and the
PyTorch usage is a breeze.

------
gok
The TensorFloat-32 results look really impressive, but yikes that is not a
good name. "TensorFloat" is extremely confusable with "TensorFlow," and it
would really more accurately be called a 19 bit format.

~~~
rrss
'brain floating point' is also bad, but no one cares because it's just
bfloat16. If this becomes popular, it will just be tf32 or tfloat32 or
something.

~~~
QuixoticQuibit
I never knew what the “b” in “bfloat” was in all these new DL chips… until
today. Man that’s bad.

~~~
boulos
Disclosure: I work on Google Cloud.

I wouldn't worry about it. Looks like the Anandtech article [1] doesn't either
:)

> bfloat16, a format popularized by Intel

[1] [https://www.anandtech.com/show/15801/nvidia-announces-
ampere...](https://www.anandtech.com/show/15801/nvidia-announces-ampere-
architecture-and-a100-products)

------
frankchn
It seems like the TF32 format is similar to BF16 but with 3 more precision
bits (in other words, it is FP32 with 13 low-order bits dropped instead of
16).

Full adders in FP units scale with the square of the mantissa bits, so if the
number of mantissa bits stay the same, they can reuse the existing units.

Since Nvidia already has existing FP16 units on die, using those units for
TF32 calculations probably doesn't cost too much additional die area.

------
krapht
Oof. Double-precision is only 2.5x better, which is less impressive than 20x
for float.

I still haven't found anything more cost-effective for double precision data
processing (cost + dev time) than a rack full of used Xeons...

~~~
frankchn
The double-precision number probably best represents the generational
improvement.

The 20x 32-bit floating point improvement is probably achieved by comparing
doing full FP32 calculations on the previous generation vs doing TF32
calculations on Ampere. This would not be an apple-to-apple comparison as the
TF32 result is less precise.

That said, it is probably not terribly important for deep learning at least,
given the success of BF16.

~~~
frankchn
Actually it looks like the double precision number in general GPU usage only
went up 25% (Volta did 7.8 TFLOPS). To get the 2.5x number, you need to use
FP64 in conjunction with TensorCores, which then gets you 19.5 TFLOPS.

Considering how big the die is (826mm^2 @ TSMC 7nm) and how many transistors
there are, they really must have beefed up the TensorCores much more than the
general compute units.

------
zelon88
I always thought that a real time scalable architecture would be beneficial.
It's refreshing to see someone working on it, and exciting to see that it's
nVidia. I always pictured a CPU with variable bit-width. Like a 256-bit ALU
that could partition itself down into 16 or 32 bit ALU's as the workload
allowed.

~~~
wmf
That's been around since MMX and AltiVec. It took a while for GPUs to adopt
subword SIMD though.

~~~
zelon88
SIMD works great for doing the same thing to multiple pieces of data, but it
doesn't do the scaling up that I described.

I'm no chip engineer, so maybe what I'm envisioning isn't possible. In
essence, instead of making 4x 64-bit cores you make 128x 2-bit cores and then
some architecture on the die to select groups of cores to build a processor of
the required size, execute some instructions with that processor, and then
disassemble the processor back into a pool of resources.

So SIMD might be able to calculate two 16-bit sums on a 32 bit processor in
one cycle, but the hypothetical CPU I'm describing will be able to calculate a
single 128 bit sum and eight 16 bit sums in one cycle, at the same time.

~~~
magicalhippo
What you're describing is basically a modern FPGA[1]. You can wire it up as
you want at runtime, and they can contain specialized hardware like hardware
multipliers and fast local memory to accelerate certain workloads.

[1]: [https://en.wikipedia.org/wiki/Field-
programmable_gate_array](https://en.wikipedia.org/wiki/Field-
programmable_gate_array)

------
aloer
For those in the industry:

When a new generation like this is released, will a typical AI company replace
the current GPUs? Is there a chance to acquire the older versions for private
use or is it too early for that?

~~~
minimaxir
A fun quick of GPU pricing economics is that on Google Cloud platform, the
relatively recent T4 (on Turing and has FP16 support) is _cheaper_ than the
ancient K80s. [https://cloud.google.com/compute/gpus-
pricing](https://cloud.google.com/compute/gpus-pricing)

~~~
tedivm
That's not a great comparison at all though. The K80 is a general purpose chip
while the T4 is explicitly marketed as an inference chip. The K80 has more ram
(super important for batch sizes during training), can access that ram faster
(480 GB/s versus 320 GB/s), and is an overall more powerful chip than the T4
is.

~~~
minimaxir
Those metrics are for both GPUs on a board; the GCP K80 only uses one GPU, so
those performance metrics in theory would be halved (notably, the GCP K80 has
12 GB VRAM vs. T4's 16 GB VRAM), and it's still more expensive than a T4.

------
YetAnotherNick
7 times V100 performance for BERT. That is insane!

~~~
sk0g
Mind you, this is an optimised version of BERT, from what I could see in their
blog post.

~~~
shubuZ
The network topology is the same, no modules are removed, etc. Of course it
will be optimized to run on Ampere

~~~
sk0g
I'll link the the mention of the optimised versions, but that's not what I
mean!

Say earlier model XYZ trained 4 epochs per hour, and BERT trained 2 epochs per
hour. Now on a single card if you can train _optimised_ BERT for 4 epochs per
hour, that doesn't necessarily mean the same card will handle XYZ at 8 epochs
per hour.

It's a technological achievement nonetheless, but the fact that it was heavily
optimised for the new architecture, possibly beyond an extent infeasible for
non-Nvidia developers, still has to be considered.

[0] [https://nvidianews.nvidia.com/news/nvidia-achieves-
breakthro...](https://nvidianews.nvidia.com/news/nvidia-achieves-
breakthroughs-in-language-understandingto-enable-real-time-conversational-ai)

------
throwawaysea
I know this isn't a traditional supercomputer and these aren't LINPACK
benchmarks, but I am still blown away by the 5 petaflops figure, considering
where we were just 10 or 15 years ago.

------
hank_z
Hmm, I was expecting 3080 ti.

~~~
DavidVoid
I believe the RTX 3000 series is expected to be announced in August. They
usually announce their GeForce cards a few months after they announce the
Quadro ones.

~~~
hank_z
Really? That's not too long to wait. Hopefully it won't get delayed by
Covid-19.

------
Keyframe
5 petaflops in DGX? That alone would put one of those babies into TOP500 top
50, and a superPOD would make no. 1, no? Well, if it could do that performance
on Linpack/Rmax.

~~~
sabalaba
No, the A100 has a 19.5 TFLOP theoretical peak for SGEMM[1], real world
benchmarks will likely achieve 93% of that, and so the DGX A100 will be 145
TFlops of FP32 SGEMM performance or 0.145 FP32 PFLOPS. Maybe in 72 FP64
TFLOPS. FP64 is what the TOP500 benchmaks count.[2]

The 5 "petaflops" number is a creatively constructed marketing number based on
FP16 TensorCore "flops", sparse matrix calculations, and then multiplying by
8x for some reason. They basically take the 19.5 FP32 TFLOPS number and
multiply it by 32x to get to the claimed 624 "TFLOPS" for a single A100. 8 *
624 = 5 "petaflops". I see they get 2x by actually using FP16 instead of FP32,
2x from counting sparse matrix ops as dense ops, and 8x from somewhere else
that I have no idea.

[1] [https://devblogs.nvidia.com/nvidia-ampere-architecture-in-
de...](https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/)

[2] [https://www.top500.org/resources/frequently-asked-
questions/](https://www.top500.org/resources/frequently-asked-questions/)

~~~
rrss
> 8x from somewhere else that I have no idea.

8 GPUs in the box.

~~~
sabalaba
No, I already multiplied 624 TFLOPS / GPU * 8 GPU = 4992 TFLOPS (the 5
petaflops number).

I'm saying that you are still missing another 8x on the way from 19.5 TFLOPS /
GPU to 624 TFLOPS / GPU. 19.5 (base FP32 theoretical peak performance) * 2
(FP16 instead of FP32) * 2 (counting sparse matrix ops as dense ops) * 8
(unknown) = 624 TFLOPS.

~~~
rrss
FP16 tensorcore = 312 tflops

x 2 (counting sparse as dense) = 624 tflops

x 8 GPUs = 5 "pflops"

The missing 8x you are looking for is just because tensorcore math is much
faster than their normal fma path.

------
tpmx
I'm mostly just amazed by that... what is it, that ivory thing above/around
his stove?

Is this a thing newly rich americans have in their kitchens?

------
jayd16
No one is talking about it but the mention of a focus on "datacenter
computing" (in the video) with CUDA is interesting. In retrospect a GPU is
basically running a map/reduce type workflow so its not that crazy after all.

Is this new or is CUDA already being used as a distributed language?

~~~
massaman_yams
cuDF has supported Dask for distributed processing for a while now, maybe a
year or two?

------
jzer0cool
Nvidia is one of the major GPU suppliers (as well as AMD and others) as there
are more compute needs in autonomous vehicles, robotics, etc. For future
demand, will there be more increases in demand for CPU's (or decrease here in
favor of GPU's), GPU's or some development other favorable processing unit?

Google had TPU's mentioned a few years back as one example. So curious to know
what market segments would likely increase, which would require such
processors to fill the increased demand for those units.

~~~
akelly
For dedicated neural network use cases like vehicles, robots, and datacenters,
dedicated accelerator chips are going to be huge. Tesla's been shipping custom
silicon for a while, I'd be surprised if Waymo cars didn't have a TPU in them,
and there's like a dozen companies making AI accelerators but as far as I
know, none of them are for sale yet.

------
arcanus
Extending the tensor Ops to FP64 is an interesting, if not surprising, design
choice. Are there many applications sure to leverage this capability? Aside
from HPL, of course.

~~~
blopeur
I am suspecting that this is specifically targeted at the HPC market as the
lack of FP64 has always been a hindrance to HPC deployment.

You have to remember that the HPC market is $35B today. HPE makes $3B a year
alone from that, maybe more with Cray acquisition.

So it's no surprise that NVIDIA wants to position themselves on that market.

Plus, you have too look at the long game with MLX acquisition ( heavy player
in the HPC market) and Cumulus. I wouldn't be surprised to see NVIDIA trying
to bypass Intel/AMD completely and offer Direct to Interconnect device. Rather
than the hybrid CPU + GPU box.

~~~
godelski
Remember too that AMD and Intel won the contracts for Aurora, Frontier, and El
Cap (the three exascale machines for the DOE). I imagine MLX is a big part of
getting the next contracts as well as seeing that a lot of these projects are
IO bound, not compute. If you can bring supercomputing like abilities to
datacenters or AI labs, that'd be a huge advantage. If you could easily split
a huge model across 64 GPUs and train like it was on a single node, this would
change the space.

~~~
fluffything
> Remember too that AMD and Intel won the contracts for Aurora, Frontier, and
> El Cap (the three exascale machines for the DOE).

This isn't very surprising I think. AMD cards often have higher FLOPs than
NVIDIA's, and I can imagine that they run HPL really well.

I can't wait to try these systems. I want to see what the OpenMP performance
there looks like for normal applications.

------
Thaxll
Is there a video or something?

~~~
KaoruAoiShiho
8 Part playlist:
[https://www.youtube.com/watch?list=PLZHnYvH1qtOZ2BSwG4CHmKSV...](https://www.youtube.com/watch?list=PLZHnYvH1qtOZ2BSwG4CHmKSVHxC2lyIPL&v=bOf2S7OzFEg&feature=emb_title)

------
baybal2
Have to admit again, Huang is an amazing salesman

------
madengr
Did anyone notice that PCB? 50 pounds, 30k components, 1M drill holes, and 1
km of traces. Pretty much blew my mind.

------
macksd
The numbers for their SATURNV supercomputer are either untrue or absolutely
staggering. 4.6 exaflops? #1 on the Top 500 list of supercomputers just barely
passed 200 petaflops at peak performance. If you add up the entire list you
only get 1.65 exaflops. And LINPACK isn't usually network-bound. How can this
possibly be true?

~~~
KenoFischer
They're counting TF32, which is a 19 bit format and comparing it to FP64.

~~~
arcanus
It's tensor Ops, and so the precision might be even lower than single. It
could be BF16, for example.

------
lustigmacher
Does anyone know what kind of GPU support for Spark is mentioned in this
announcement? Are they talking about existing XGBoost acceleration, or is it
something general-purpose?

~~~
zetazzed
This goes beyond the existing Spark+XGBoost GPU acceleration to include ETL,
Spark SQL, etc. Coming for Spark 3. Full details here:
[https://www.nvidia.com/en-us/deep-learning-
ai/solutions/data...](https://www.nvidia.com/en-us/deep-learning-
ai/solutions/data-science/apache-spark-3/)

------
twarge
They show a fictional render of the main chip surrounded by four large gold
coated leadless packages. What are those supposed to be?

~~~
wmf
VRMs.

------
liminal
How does RAPIDS relate to GOAI and Arrow? It seems like the same technology
keeps getting a name change...?

~~~
roaramburu
GoAi was to get GPU developers on the same page and to work together to build
an ecosystem for analytics on GPUs.

RAPIDS is a project that was born out of GoAi to bring that ecosystem to
Python.

It is built on Apache Arrow (although on GPU memory), and has many of the
original GoAi members like my team, BlazingSQL, and others such as Anaconda,
Nvidia, and many MANY others.

------
person_of_color
Still waiting for Wave Computing to ship.

~~~
trsohmers
Well considering they are filing for bankruptcy you are going to be waiting a
while...

~~~
rrss
Speaking of vapor, still waiting for rex computing to ship...

~~~
trsohmers
Me too; sadly we had silicon that ran great (We had better performance per
watt for FP32 and FP64 on 28nm process vs the A100's 7nm GFLOPs/watt) but
targeting a market that had 3 customers that didn't want to work with a
startup, plus investors that were opposed to us going into the "risky" and
"unproven" AI space. Still have 183 of the chips in my closet waiting to see
the light of day :/

------
lowdose
> DGX-A100 - The First HPC System With 140 Peta-OPs Compute Shipping Now For
> $199,000

Crazy pricing for a chip that basically started as a game solution. Anybody
actually seen these Nvidia DGX systems in the wild?

~~~
jamesblonde
We are working with 20 of them at a customer....

~~~
thu2111
Can you talk about what sort of things you're doing, in general terms? Does
the new generation DGX look like a worthwhile upgrade for you?

~~~
jamesblonde
Large scale deep learning. I don't make the decision on upgrades. But if i
were buying new dgx-1s, i would buy these new ones. But i wouldn't buy a dgx-1
in the first place - it is an appliance with software nobody wants. Buy a
commodity server with V100s and NVLink for 66% of the price. Like HPE or
somebody else sells.

------
reedwolf
Stealing Epic Games/Unreal Engine's thunder.

If Nvidia was a human, they'd be the type to propose at someone else's
wedding.

~~~
notact
So wait, the GTC keynote which was scheduled weeks/months ago, happens to
follow news that Epic dropped with absolutely no warning, and apparently that
counts as stealing thunder. Is that really what you are saying?

~~~
__alexs
Nvidia probably collaborated on the Epic demo but also they are aimed at two
entirely different sets of customers. Nvidia's news cycle is data center
updates in May, consumers/gamer updates in September.

~~~
chupasaurus
Epic demo was running on PS5 so no NVIDIA involvement there.

~~~
__alexs
Oh yes, I forgot they've entirely lost this console generation.

~~~
paavohtl
Well not entirely (and depending on what you count as this console
generation): the Nintendo Switch is based on an NVidia SOC. Although that's
probably not the biggest money maker.

