
IBM scientists demonstrate 10x faster large-scale machine learning using GPUs - brisance
https://www.ibm.com/blogs/research/2017/12/10x-faster-using-gpu/?utm_source=twitter&utm_medium=social&utm_campaign=AI&utm_content=nips2017
======
web007
> We can see that the scheme that uses sequential batching actually performs
> worse than the CPU alone, whereas the new approach using DuHL achieves a 10×
> speed-up over the CPU.

I had to get down to the graph to realize they're talking about SVM, not deep
learning.

This could be pretty cool. Training a SVM has usually been "load ALL the data
and go", and sequential implementations are almost non-existent. Even if this
was 1x or 0.5x speed and didn't require the entire dataset at once it's a big
win.

~~~
jamesblonde
Yes, i felt cheated when i read it was about training 1/10th of ImageNet on a
SVM. I guess IBM are desparate not to be left behind in the race for
distributed deep learning platforms.

~~~
samfriedman
To be honest, I'd readily cheer any groups working on traditional machine
learning advancements despite all the current hype for neural methods.

~~~
madenine
I'll second that. For all the attention DL/ANNs get... there's still a lot of
legwork going on out there using linear models, basic trees, etc. IIRC this
years kaggle survey ranked Logistic Regression as the #1 most used model by a
long shot.

~~~
make3
neural networks are stacked logistic regressions. a lot of the deep learning
research benefits logistic regression

------
dragontamer
I'd love to see more details.

Ultimately, it seems like IBM has managed to make a generalized gather/scatter
operation over large datasets in this particular task. Yes, this is an "old
problem", but at the same time, its the kind of "Engineering advancement" that
definitely deserves talk. Any engineer who cares about performance will want
to know about memory optimization techniques.

As CPUs (and GPUs! And Tensors, and FPGAs, and whatever other accelerators
come out) get faster and faster, the memory-layout problem becomes more and
more important. CPUs / GPUs / etc. etc. are all getting way faster than RAM,
and RAM simply isn't keeping up anymore.

A methodology to "properly" access memory sequentially has broad applicability
at EVERY level of the CPU or GPU cache.

From Main Memory to L3, L3 to L2, L2 to L1. The only place this
"serialization" method won't apply is in register space.

The "machine learning" buzzword is getting annoying IMO, but there's likely a
very useful thing to talk about here. I for one am excited to see the full
talk.

~~~
web007
They buried it, but their NIPS 2017 paper is linked in the article.

[https://arxiv.org/abs/1708.05357](https://arxiv.org/abs/1708.05357)

~~~
dragontamer
Thanks.

It does seem specific to machine learning / Tensors. But that's still cool.
I'll have to sit down and grok the paper more carefully to fully understand
what they're doing.

------
LolWolf
This is pretty fascinating! Though the concept seems to work only for convex
problems (in particular, problems which have strong duality; this excludes NNs
in almost their entirety, except 1 layer nets), but the application is nice
and straightforward.

I wonder if there is a similar lower-bound which can be constructed for non
convex problems which retain enough properties for this method to be useful?

------
panosv
How about if you did the same on 8 or 16 core CPU that can have much more than
16 GB of memory and is not as expensive to move data around its own memory?

~~~
bitL
Roughly 1000x slower? GPUs nowadays have 5000+ "cores" inside.

~~~
dragontamer
> Roughly 1000x slower?

Not really. A modern Coffee Lake i7 has several distinct advantages over GPUs.
(AMD Ryzen also has similar advantages, but I'm gonna focus on Coffee Lake)

1\. AVX2 (256-bit SIMD), for 32-bit ints / floats that's 8 operations per
cycle. AVX512 exists (16 operations per cycle) but it its only on Server
architectures. Also, AVX512 has... issues... with the superscaling point#2
below. So I'm assuming AVX2 / 256-bit SIMD.

2\. Superscalar execution: Every Skylake i7 (and Coffee Lake by extension) has
THREE AVX ports (Port0, Port1, and Port5). We're now up to 24-operations per
cycle in fully optimized code... although Skylake AVX2 can only do 16 Fused-
multiply-adds at a time per core.

3\. Intel machines run at 4GHz or so, maybe 3GHz for some of the really high
core-count models. GPUs only run at 1.6GHz or so. This effectively gives a 2x
to 2.5x multiplier.

So realistically, an Intel Coffee Lake core at full speed is roughly
equivalent to 32 GPU "cores". (8x from AVX2 SIMD, x2 or x3 from Superscalar,
and x2 from clock speed). If we compare like-with-like, a $1000 Nvidia Titan X
(Pascal) has 3584 cores. While a $1000 Intel i9-7900 Skylake has 10 CPU cores
(each of which can perform as well as 32-NVidia cores in Fused MultiplyAdd
FLOPs).

i9-7900 Skylake is maybe 10x slower than an Nvidia Titan X when both are
pushed to their limits. At least, on paper.

And remember: CPUs can "act" like a GPU by using SIMD instructions such as
AVX2. GPUs cannot act like a CPU with regards to latency-bound tasks. So the
CPU / GPU split is way closer than what most people would expect.

\-------------

A major advantage GPUs have is their "Shared" memory (in CUDA) or "LDS" memory
(in OpenCL). CPUs have a rough equivalent in L1 Cache, but GPUs also have L1
cache to work with. Based on what I've seen, GPU "cores" can all access Shared
/ LDS memory every clock (if optimized perfectly: perfectly coalesced accesses
across memory-channels and whatever. Not easy to do, but its possible).

But Intel Cores can only do ~2 accesses per clock to their L1 cache.

GPUs can execute atomic operations on the Shared / LDS memory extremely
efficiently. So coordination and synchronization of "threads", as well as
memory-movements to-and-from this shared region is significantly faster than
anything the CPU can hope to accomplish.

A second major advantage is that GPUs often use GDDR5 or GDDR5x (or even HBM),
which is superior main-memory. The Titan X has 480 GB/s (that's "big" B,
bytes) of main memory bandwidth.

A quad-channel i9-7900 Skylake will only get ~82 GB/second when equipped with
4x DDR4-3200MHz ram.

GPUs have a memory-advantage that CPUs cannot hope to beat. And IMO, that's
where their major practicality lies. The GPU architecture has a way harder
memory model to program for, but its way more efficient to execute.

~~~
jamesblonde
Very good analysis, and a correct conclusion that memory bandwidth is the
bottleneck (at least for Matrix fused multiply-add intensive workloads - like
feeedforward NNs and Convnets). We have done experiments on the 1080Ti (484
GB/s) and for 32-bit FP training (convnets on tensorflow), it is close in
performance to the P100 (717 GB/s).

The other point to add is that SIMD operation for GPUs is what gives them
efficient batched reads from GPU memory for each operation.

~~~
dragontamer
Thanks.

I can't say I'm an expert yet. But the more and more I read about highly
optimized code on any platform, the more and more I realize that 90% of the
problem is dealing with memory.

Virtually every optimization guide or highly-optimized code tutorial spends an
enormous amount of time discussing memory problems. It seems like memory
bandwidth is the singular thing that HPC coders think about the most.

------
yters
SVMs have better generalization possibilities than NNs, so this is neat.

------
samnwa
How do I use this to mine bitcoin? K thanks.

------
WhitneyLand
tldr: They made a caching algorithm.

Article was touched by pr dept, but still has actually information.

longer tldr:

They did the same thing that has been done for thousands of years. Back then
the hot area of research was how to stage advance food and resource caches
along a route for long journeys. They came up with algorithms to optimize
cache hits.

In this case, the problem is GPUs can be fast for ML, but usually only have
16GB ram when dataset can be terabytes.

Simple chunk processing would seem to solve the problem, but it’s turns out
overhead of cpu/gpu transfers badly degraded performance.

Their claim here is they can on the fly determine how important different
samples are, and make sure samples that yield better results are in the chance
more often than those with less importance.

~~~
alexasmyths
To be fair - most innovation boils down to this kind of incremental stuff. 99%
of 'tech' is an amalgamation of more basic ideas, not 'magic leap' kind of
innovation.

I mean, we all love the magic, but I think we're getting spoiled as of late
with all the magic AI/Deep Learning stuff coming out.

~~~
inputcoffee
I agree with your first statement to the point of disagreeing with your
second. i.e. even the magic stuff is just incremental progress that people
were not paying attention to. (Self driving cars have been wowing people since
the 90s, object recognition just got incrementally better every year etc)

