
AMD has put an SSD on a graphics card - kungfudoi
http://www.techradar.com/news/computing-components/graphics-cards/amd-has-just-put-an-ssd-on-a-graphics-card-1325400
======
TheCartographer
Professional GIS user here. Most of my high resolution terrain models are well
within the capacity of even a modest GPU to load into WebGL and run on a
browser without breaking a sweat. Terrain models are static by nature and
don't require a lot of horsepower once they are loaded.

The struggle comes in reading the data from storage, where several minutes can
be spent loading in a single high resolution raster for analysis/display. When
I built my own PC this year, I splurged on an M.2 SSD for the OS and my main
data store. Best decision I ever made for my workflow - huge 3D scenes that
formerly took minutes to load on a spinning platter now pop up in seconds.

This thing would probably be the bees knees for what I do. Shame it starts at
$10k (and that it's AMD so no CUDA, so no way to justify it at work for "data
science" :-/).

~~~
thescriptkiddie
Any idea why Nvidia dominates in "serious" GPGPU applications? I remember
people mocking them for refusing to adopt OpenCL, and when they finally caved
their implementation performed far worse than AMDs. How did they win people
over? Did they give out a bunch of free GPUs to universities or something?

~~~
kartD
CUDA mainly. It's fast (faster than OpenCL) and NVIDIA is really good with
their software. CuDNN for deep neural networks is almost an industry standard.
Nvidia understands software and markets better, while AMD sits on their butts
for too long. Granted AMD always come out with a good open source solution
that is always just a bit worse and very late. NVIDIA tries to create markets
while AMD mess up and end up becoming followers. Shame really.

Edit: this is a good step though. AMD should be pushing the envelope and
hopefully with Zen, they can actually realize some of the gains of HSA (which
they tried to pioneer but it wasn't so useful since Bulldozer isn't that good)

~~~
wyldfire
CUDA is not [unqualified] "faster than OpenCL". NVIDIA does design software
better than AMD, without a doubt. I think it's likely that NVIDIA decided to
push CUDA and lag OCL support if their customers balked at having to port
between the two. It's not that hard to port IMO, they're extremely similar.

~~~
AIMunchkin
OpenCL IMO is an ugly API born of the equivalently ugly CUDA driver API
because Steve Jobs got butthurt at Jensen Huang for announcing a deal with
Apple prematurely. Downvote all you like, but as John Oliver would say "That's
just a fact." I witnessed it secondhand from within NVIDIA.

In contrast, OpenCL _could_ have been a wonderful vendor-independent solution
for mobile, but both Apple and Google conspired independently to make that
impossible (ironic in Apple's case because of OpenCL's origin story and
idiotic in the case of Google and its dreadful Renderscript, a glorified
reinvention of Ian Buck's Ph.D. thesis work, Brook).

Fortunately, AMD appears to have figured out OpenCL has no desktop traction
and they have embarked on building a CUDA compiler for AMD GPUs called R.O.C
(Radeon Open Compute). They have also shown dramatically improved performance
at targeted deep learning benchmarks. It's early, but so is the deep learning
boom.

The wildcard for me is what Intel will decide to do next.

The big win IMO is vendor-unlocking all the OSS CUDA code out there.

[https://github.com/RadeonOpenCompute](https://github.com/RadeonOpenCompute)
[https://techaltar.com/amd-rx-480-gpu-review/2/](https://techaltar.com/amd-
rx-480-gpu-review/2/)

~~~
wyldfire
Jobs might've been butthurt and it might've made better incentive but nobody
likes to sole-source critical technology elements.

~~~
AIMunchkin
It would have been _fantastic_ if Intel had stopped beating the linpack horse
a lot sooner and built a viable competitor to GPUs by now. Not this timeline
though alas... Maybe 2020?

------
protomok
I'm having trouble seeing the value add here. From the Anandtech article [1]
they are using 2x512GB Samsung 950 Pro SSDs which use PCIE v3 x4 with M.2
connectors and a PCIE switch. The drives presumably are using NVME.

The demo claims that without the SSDs they were rendering raw 8k video @17 fps
and using the SSDs improved rendering to > 90fps. How can this be such a
significant improvement over accessing the same SSDs connected directly to a
motherboard? The graphics card would have a PCIE 3.0 x16 connection...plenty
of bandwidth and very low latency.

Maybe I'm missing something?

[1] - [http://www.anandtech.com/show/10518/amd-announces-radeon-
pro...](http://www.anandtech.com/show/10518/amd-announces-radeon-pro-ssg-
polaris-with-m2-ssds-onboard)

~~~
DigitalJack
The writer had some of the same thoughts, and found this:

"The performance differential was actually more than I expected; reading a
file from the SSG SSD array was over 4GB/sec, while reading that same file
from the system SSD was only averaging under 900MB/sec, which is lower than
what we know 950 Pro can do in sequential reads. After putting some thought
into it, I think AMD has hit upon the fact that most M.2 slots on motherboards
are routed through the system chipset rather than being directly attached to
the CPU. This not only adds another hop of latency, but it means crossing the
relatively narrow DMI 3.0 (~PCIe 3.0 x4) link that is shared with everything
else attached to the chipset."

~~~
protomok
Thanks, I missed this. I noticed that the PCIE 3.0 x1 bandwidth is just above
what AMD reported in their throughput test - 985 MB/s...I wonder if AMD used a
system with the SSD connected to a M.2 slot on the motherboard with just a
PCIE 3.0 x1 link.

I should clarify that I do think this onboard SSD concept could be really
compelling for certain use cases, such as needing to store several hundred
gigs of data which needs to be randomly accessed.

~~~
voltagex_
I wonder if this would be useful to companies like Pixar and Weta Digital - I
imagine a big speedup in frame rendering time or a reduction in number of
build machines required would be worth lots of cash.

------
nwah1
Surprisingly, this makes a lot of sense, and I'm glad that it uses an M.2
port. Thus, technically, you should be able to swap out the SSD for any
screaming fast model you want.

~~~
Chilinot
I can't find anything that supports your claim that it uses an M.2 port. But
even if it does, sadly the SSD will most likely be soldered directly on the
PCB like it is in most tablets and phones.

It would have been really cool to have been able to swap it out as needed, but
I don't think that will be doable :(

~~~
jsheard
Anandtech say it has two M.2 x4 ports:

[http://www.anandtech.com/show/10518/amd-announces-radeon-
pro...](http://www.anandtech.com/show/10518/amd-announces-radeon-pro-ssg-
polaris-with-m2-ssds-onboard)

And on the back you can see the standard tiered screw holes for fitting
different M.2 card lengths:

[https://twitter.com/FudzillaNews/status/757774661081899008](https://twitter.com/FudzillaNews/status/757774661081899008)

------
phamilton
I used to do research on gene sequencing on a GPU. For small sets it was quite
fast (it's arguably a O(n^4) algorithm, though really a O(n^2m^2)) but once
you couldn't fit the data set on the GPU it was dead slow.

This would solve that problem nicely.

~~~
wyldfire
Well, maybe. It would certainly improve. But you probaly still won't get
anywhere near the GDDR5(etc) speeds from those memory accesses. Depending on
the memory access pattern it might end up dropping down significantly. But you
might still end up considering it "dead slow."

~~~
kartD
I think the main advantage is the GPU doesn't have to go through the CPU to
get data.

i.e. the transaction goes from this: GPU -> CPU -> HDD -> GPU

to this: GPU -> HDD -> GPU

~~~
wyldfire
Sure, that part was clear. But the difference between "quite fast" for _this
application_ and "dead slow" might be that hundreds of GB/s with GDDRx/HBM is
so much faster than anything else. Even the M.2 SSD onboard is probably only
~5GB/s. So if the walk pattern is something that is predictable enough by the
GPU (usually pretty linear, but some memory features have 2d/3d locality),
then you could crunch those numbers as fast as the SSD could deliver them. Now
it becomes a question of how much computation/kernel time can you spend on a
chunk of data?

------
banachtarski
3 predictions:

1\. GPU driven command buffer workloads will become more significant (now that
literally all your textures can exist in VRAM, it's worth going the extra mile
to elide a GPU-CPU round trip).

2\. Voxel based techniques (which were generally memory intensive) open up
again. That's modeling destructible environments, atmospheric effects,
transparent materials, etc in a more accurate and performant way.

3\. Entire scene graphs can live on the GPU which opens up a lot of design
space for new volume hierarchies and data structures.

------
redtuesday
Here is a video that shows the GPU in action:
[https://amp.twimg.com/v/b080f46d-e408-4348-864e-7cc12e40d1ac](https://amp.twimg.com/v/b080f46d-e408-4348-864e-7cc12e40d1ac)

------
ethagknight
How long would an SSD last under a graphics card workload?

~~~
sp332
It comes with a 10-year warranty, so it might not even matter. But back in
2014, 240GB drives were approaching a petabyte of continuous writes before
failing. [https://techreport.com/review/27909/the-ssd-endurance-
experi...](https://techreport.com/review/27909/the-ssd-endurance-experiment-
theyre-all-dead) With a couple more years of manufacturing and wear-leveling
improvements, and more space for the algorithms to play with, it could be a
lot more now.

~~~
mankash666
On the contrary, SSD lifespans have actually decreased. Smaller transistors
considerably reduce lifespan, and the advent of storing 2 (MLC) and 3 (TLC)
bits per cell has hastened the trend.

3D-NAND uses different physics, but as of now hasn't yet matched legacy NAND's
endurance (lifespan)

~~~
Dylan16807
> 3D-NAND uses different physics, but as of now hasn't yet matched legacy
> NAND's endurance (lifespan)

What makes you say that?

~~~
mankash666
Facts make me say that.

3D-NAND uses charge-trap flash while planar NAND uses floating-gate-
transistors based flash.

Working in a NAND company also makes me say that.

~~~
Dylan16807
By "legacy" do you mean particularly old flash, or planar floating-gate flash
in general? Because samsung apparently claims that they can do 3D _TLC_ with
better reliability and density than 10nm planar _MLC_.

------
Sylos
So, what you're telling me is that I can now build a complete computer into my
computer as graphics card?

~~~
wyldfire
Depending on your perspective, your computer has contained several "computers"
for many generations. Storage, network, display, and other adapters often have
something like a CPU onboard capable of executing software instructions.

------
pella
[http://www.anandtech.com/show/10518/amd-announces-radeon-
pro...](http://www.anandtech.com/show/10518/amd-announces-radeon-pro-ssg-
polaris-with-m2-ssds-onboard)

------
jamesfmilne
This 8K video demo is stupid.

Lets assume a relatively modest 8bit per component RGB, a 7680x4320 frame is
132MB, at 24fps is 3.2GB/s.

So in 1TB you can store 312 seconds aka 5 mins of video.

You're going to have to top this up to play any more than 5 mins, which means
writing to the SSD at 3.2 GB/s whilst you read from it. I do not believe even
any NVMe SSDs are full duplex, so that's 6.4GB/sec, which the SSDs cannot
provide.

You can DMA to GPU memory at 10GB/sec, so the SSDs are of no benefit for
video.

No-one stores 8K data uncompressed anyway, so the CPU will need to read and
decompress the data.

Perhaps it's useful for faulting in megatextures for 3D scenes though.

~~~
sigi45
It shows the bandwith bottleneck nicely.

------
mankash666
Sounds like an excellent setup for machine learning / big-data workloads.

People have already moved data-base computation to the GPU, providing low-
latency NVM access to such a set-up would really accelerate things.

------
c2h5oh
This might be useful for consumer cards as well: a tiny (0.25-2x VRAM)
compiled shader or texture cache could possibly help with the framerate dips
(microstutters).

I know that PCIe bandwidth is not really an issue and SSD latency is
significantly higher than RAM (but with NVMe pushing below 3 µs it makes you
wonder how low a custom interface could go..)

------
theandrewbailey
> The graphics card has already been demonstrated rendering a raw 8K video –
> the initial demo showed this running at 17 frames per second, and switching
> to the SSG, that was boosted to over 90 frames per second.

I doubt that the same GPU was used. If it is, we're in trouble (or have
unoptimized software). I thought that someone would have already solved the
"let's stream data in and out and not interrupt whatever the GPU is doing"
problem, but meanwhile many people are fretting over what PCIe version (thus,
attained speeds) they can use.

~~~
wyldfire
Indeed modern GPUs do copies over the bus while executing shaders/kernels. But
once you saturate the bus, you are limited in how much shader execution you
can accomplish as a result. If this scene required more textures than fit in
memory, it would benefit from having a local DDR-accessible cache instead of
having to grab it over PCIe.

~~~
theandrewbailey
I agree, but...

A GPU's 16 lanes of PCIe 3.0 provide almost 16 GB/s of bandwidth. An M.2 SSD
(used in this card) is at best 4 GB/s (PCIe 3.0 x4 lanes). This card has two
M.2 slots.

Even at those speeds, accessing data from system RAM is faster. (depends on
specifics, but system RAM is easily 40+ GB/s) How does increasing data
bandwidth 50% increase performance by 4? Are inefficiencies (and latencies)
really eating up all that?

------
mankash666
What if the next PSP or XBox used this architecture? Games could really fly!

------
ungzd
Instead of fixing bus that's too slow for SSD (really too slow?) they're made
that hack with duct tape. And it surely has proprietary API and direct
addressing (no filesystem) like in 50s.

Then they'll integrate ethernet into video card for high speed traders (also
can be marketed as lower latency for gaming). But why not to integrate whole
PC into it? With gaming console-style OS on it.

~~~
foldor
That is certainly a lot of leaps you're taking there. There's no evidence to
suggest any of those things you've implied.

------
FatalBaboon
Does it only bug me that with the addition of cold storage on the card, GPUs
scream of poor design?

If I was rewriting the same-ish REST helper on every micro-service of a
project, I would immediately think of refactoring and yet GPUs now have
CPU/RAM/Disk.

Is the future GPU-only or are we in dire need of a better way to use existing
PC components from GPUs?

~~~
Athas
I don't understand. What does hardware design have to do with refactoring,
usually a software term? The addition of the SSD is motivated by basic
physics, which creates the need for computational units and storage units to
be close together. (The alternative would be a non-von Neumann model without
this compute-storage distinction, but that's not so easy.)

------
tracker1
With UHD and higher monitors coming down the pipeline, something like this
could be a game changer for even gaming graphics... right now, the GTX 1080
can't consistently hit 60fps in newer games at 4K. Something like this could
improve this a bit. Even the technique in adding say 64-128GB of secondary
storage soldered onto the graphics board could be significant.

~~~
floatboth
nah, storage is not a problem with gaming these days!

Wait for the Vega cards with 8GB of HBM2 — that will kick ass.

