
Smart Algorithms beat Hardware Acceleration for Large-Scale Deep Learning - rkwasny
https://arxiv.org/abs/1903.03129
======
comicjk
Important to note that the datasets used are sparse, and that the key to this
algorithm is better exploitation of sparsity. The GPU over CPU advantage is a
lot lower if you need sparse operations, even with conventional algorithms.

"It should be noted that these datasets are very sparse, e.g., Delicious
dataset has only 75 non-zeros on an average for input fea- tures, and hence
the advantage of GPU over CPU is not always noticeable."

In other words, they got a good speedup on their problem, but it might not
apply to your problem.

~~~
thesz
WaveNet, if I remember correctly, has 1-from-256 encoding of input features.
And 1-from-256 encoding of output features.

It is extremely sparse.

If you look at language modeling, then things there are even sparsier -
typical neural language model has 1-from-several-hundredths-of-thousands for
full language (for Russian, for example, it is in range of 700K..1.2M words
and it is much worse for Finnish and German) and 1-from-couple-of-tens-of-
thousands for byte pair encoded language (most languages have encoding that
reduced token count to about 16K distinct tokens, see [1] for such an
example).

[1] [https://bellard.org/nncp/](https://bellard.org/nncp/)

The image classification task also has sparcity at the output and, if you
implement it as RNN, a sparsity at input (1-from-256 encoding of intensities).

Heck, you can engineer you features to be sparse if you want to.

I also think that this paper is an example of "if you do not compute you do
not have to pay for it", just like in GNU grep case [2].

[2] [https://lists.freebsd.org/pipermail/freebsd-
current/2010-Aug...](https://lists.freebsd.org/pipermail/freebsd-
current/2010-August/019310.html)

Given all that I think it is a paper about combination of very clever things
which give excellent results in a synergy.

~~~
zeroxfe
> If you look at language modeling, then things there are even sparsier -
> typical neural language model has 1-from-several-hundredths-of-thousands for
> full language

Most real-world models don't use one-hot encodings of words -- they use
embeddings instead, which are very dense vector representations of words.
Outside of the fact that embeddings don't blow out GPU memory, they're also
semantically encoded, so similar words cluster together.

~~~
thesz
First, you need to compute these embeddings at least once - sparsity, here you
are! Second, these embeddings may be different between tasks and accuracy from
their use may differ too.

For example, the embeddings produced from CBOW and skipgram word2vec models
are strikingly different in cosine similarity sense - different _classes_ of
words are similar in CBOW and skipgram.

------
tyingq
The best laypersons's summary I could find:

 _" SLIDE doesn’t need GPUs because it takes a fundamentally different
approach to deep learning. The standard “back-propagation” training technique
for deep neural networks requires matrix multiplication, an ideal workload for
GPUs. With SLIDE, Shrivastava, Chen and Medini turned neural network training
into a search problem that could instead be solved with hash tables"_

[https://insidehpc.com/2020/03/slide-algorithm-for-
training-d...](https://insidehpc.com/2020/03/slide-algorithm-for-training-
deep-neural-nets-faster-on-cpus-than-gpus/)

~~~
ganzuul
Seems this is related to an adaptive approach which GPUs don't have support
for, but could be made to. I think this means the next version of TPUs will
support it, and then GPUs follow closely after.

~~~
yvdriess
No, their approach changes the fundamental access pattern into something
anathema to GPU and TPU architectures.

In ELI5 or layman's terms: current GPU/TPU accelerators are specialized in
doing very regular and predictable calculations very fast. In deep learning a
lot of those predictable calculations are not needed, like multiplying with
zero. This approach leverages that and only does the minimal necessary
calculations, but that makes it very irregular and unpredictable. Regular CPUs
are better suited for those kind of irregular calculations, because most other
general software is that as well.

~~~
numpad0
In layman's response that sounds like that network could use normalization

~~~
Rannath
Simplify normalization if you want a layman's terms.

------
nabla9
They don't implement SLIDE in GPU so we don't actually know if CPU is faster
than GPU. It's SLIDE on CPU vs softmax & sampled softmax in GPU comparison.

They should at least give rationale why GPU is not speeding up locality
sensitive hash based algorithm. GPU's are used to compute fast hashes (they
were used in Bitcoin mining once).

It's Intel sponsored research, but come on.

~~~
yvdriess
Since when is calculating the hash the bottleneck in hash table access?

The bottleneck is the:

    
    
      ptr = h(x);    // trivial, ~0.5-2 cycles
      bucket = *ptr; //  --> ~200-300 cycles
    

edit: reading the paper, it's pointers to the data being stored, so I have to
add the following as well:

    
    
      el_ptr = bucket[i]
      el = *el_ptr
    

So that's two dependent random-access loads.

~~~
johnlorentzson
Forgive me if this should be obvious, but why would a simple read from a
pointer take so many cycles?

~~~
erikmolin
because you have to fetch it from RAM, unless the problem is small enough to
fit in cache

~~~
johnlorentzson
Ah, right. I don't know how I forgot about cache and RAM.

------
mrb
As someone who ported memory-bound workloads to GPU, I say SLIDE appears it
would run even faster on GPU than on CPU. I skimmed the paper and SLIDE is
described as a memory-bound workload, specifically random memory accesses to
2-10GB of data. GPUs excel at this type of workload. For example the Ethereum
PoW (ethash) is memory-bound, and GPU ethash implementations are faster than
CPU ones.

So I don't understand why the authors don't mention the possibility of
implementing SLIDE on GPU. Of course, I could have missed something (I spent
less than 10min reading the paper)...

~~~
yvdriess
No, that is definitely not the case, not all memory-bounds are of the same
type.

Hash tables produce lots of data-dependent random accesses into DRAM which are
definitely not better on GPUs. Warps divergence, bank conflicts, partial
cache-line access inefficiencies, etc. Even CPUs struggle on this type of
pointer chasing workloads due to cache inefficiencies. Open addressing schemes
such as robin hood hashmaps are popular because they reduce the amount of
pointer chasing.

Your example is a false comparison, generating a hash is very different from
pointer chasing the address generated by the hash.

~~~
mrb
" _Your example is a false comparison, generating a hash is very different
from pointer chasing the address generated by the hash._ "

That's not true at all in the case of ethash, where the running time is
dominated by waiting for memory read ops to complete, not waiting for ALU ops
(hashing) to finish executing.

I have also written an Equihash miner where the workload is similar: running
time dominated by hashtable reads or writes, and can confirm GPUs beat CPUs.

I reiterate: GPUs _excel_ at data-dependent random reads, compared to CPUs.
Sure, these are very difficult workloads for both CPUs and GPUs, but GPUs
still trump CPUs. That's because the atom size (minimum number of bytes that a
GPU/CPU can read/write on the memory bus) is the same or better on GPU: 64
bytes on CPU (DDR4), and 32/64 bytes on GPU (HBM2), and GPUs have a
significantly higher memory throughput up to 1000 GB/s while CPUs are still
stuck around 200 GB/s per socket (AMD EPYC Rome 8-channel DDR4-3200).

So in ethash or Equihash mining workloads, many data-dependent read ops across
a multi-GB data structure (much larger than local caches) will be mostly
bottlenecked by (1) the maximum number of outstanding read ops the CPU/GPU
memory controller can handle and (2) the overall memory throughput. In the
case of GPUs, (1) is not really a problem, so you end up being bottlenecked by
overall memory througphut. That's why GPUs win.

As of 3-4 years ago I remember Intel CPUs having a maximum number of 10
outstanding memory operations so (1) was the bottleneck. But things could have
changed with more recent CPUs. In any case, even if (1) is not a bottleneck on
CPUs, their lower memory throughput guarantee they will perform worse than
GPUs on such workloads.

~~~
yvdriess
Correct, in GPUs can indeed do a better job at hiding latency through massive
parallelism.

My expertise might be outdated here, but the problem used to be that actually
getting to that max bandwidth through divergent warps and uncoalesced reads
was just impossible.

Is this still the case with Volta? Did you avoid these issues in your Equihash
implementation?

~~~
mrb
Divergent warps are still a huge problem (but SLIDE doesn't have this problem
AFAIK).

Uncoalesced reads are not a problem severe enough to make GPUs underperform
CPUs. Or, said another way, uncoalesced reads come with a roughly equally bad
performance impact on both GPUs and CPUs.

------
LargoLasskhyfv
Why do you all guess? The paper has a link to their github with the source and
instructions:
[https://github.com/keroro824/HashingDeepLearning](https://github.com/keroro824/HashingDeepLearning)

I see no reason not to try it with some AMD Threadripper or EPYC instead.

~~~
threeseed
It looks like it benefits from AVX512 which AMD does not support.

Might be worth trying on something like an 10940x/10980xe if you can get your
hands on them.

~~~
thesz
AVX512 benefits are coming from gather-scatter instructions, I think.

What is interesting here is that in their current implementation they aren't
very beneficial [1] and [2].

[1]
[https://arxiv.org/pdf/1806.05713.pdf](https://arxiv.org/pdf/1806.05713.pdf)
[2] [https://www.sciencedirect.com/topics/computer-
science/scatte...](https://www.sciencedirect.com/topics/computer-
science/scatter-instruction) (recommends these instructions to be used outside
of main loop)

I remember vaguely that first implementations of scatter/gather instructions
were not faster than sequential access from different memory registers.

And, thusly, it may come handly that AMD has much bigger core count because
each thread will have less memory to access.

------
kevingadd
For reference since it's kind of buried/obfuscated: Their point of comparison
is between an NVIDIA Tesla V100 and a 44-core CPU. The latter is probably
something like a Xeon E5-2699, which has a list price of $4115 USD. Hard to
find accurate pricing data for the V100, but it looks like it was in the
$7-10k USD range back in 2018. Still a cool cost/performance improvement but
not as massive as I was expecting before I looked up the test hardware.

~~~
raghavtoshniwal
Also worth noting that that Xeon processor takes somewhere around 350-400W and
the V100 is also in the range of 300W. So not huge on energy savings. Although
a really cool, potentially industry shaking breakthrough, regardless.

~~~
mkl
The V100 would need to run for 3.5 times as long, so would use way more
energy. E.g. training for 1 hour on CPU: 400W×1h = 400Wh of energy, vs
training for 3.5h on GPU: 300W×3.5h = 1050Wh of energy.

~~~
steerablesafe
Why is this downvoted? This is absolutely right.

------
jandrewrogers
> "hashing is a data-indexing method invented for internet search in the
> 1990s"

Eh? It was invented multiple decades prior to the 1990s. Some days you could
get the impression that computer science did not exist before the Web.

------
m0zg
A more correct headline would be: "An algorithm beats a poorly optimized GPU
implementation on a narrow problem that uses an extremely sparse dataset using
$8.5K worth of CPU".

This is for recommenders only, and it does not translate to anything else. I
don't know why everyone seems to misrepresent this as NVIDIA's undoing. Read
the freaking paper, people.

------
bmh
I'm curious to see if they can apply this to industry-standard tests like
ImageNet classification.

The workloads that they test on make it hard to quantify the broad
applicability of their work.

------
fareesh
> Slide: 3.5x Faster Deep Learning on CPU Then on GPU

Typo in the submission title - it should say "Than" on GPU

------
regularfry
I _think_ I'm right in saying that the type of locality-sensitive hash systems
they're talking about are not entirely dissimilar to Igor Aleksander's WISARD,
RAM-based recognisers from the 80's. I suspect the latter is a special case of
the former. How far off-base am I?

------
andrewmatte
A reminder for us all: No matter how much faster your hardware, you can still
write slower code

------
ironfootnz
I feel like this is more a stunt for the new intel xenon. Creating a new
paradigm on a heterogeneous hardware dependency is misleading. Linear analysis
could be more explored. They could achieve the same with less. Holomorphic
functional analysis.

But it’s a good try. I’d say that this catch’s my interesting as is a valid
point to optimizations
[https://arxiv.org/pdf/1908.05858.pdf](https://arxiv.org/pdf/1908.05858.pdf)

------
signa11
actual paper:
[https://arxiv.org/abs/1903.03129](https://arxiv.org/abs/1903.03129)

------
bitL
Did they just accidentally kill NVidia's business model?

~~~
zozbot234
I don't think so. Matrix multiplication has plenty of uses besides neural
network training, and GPUs will still excel at those workloads.

~~~
bitL
Matrix multiplication with INT8, INT16 or BFLOAT16 doesn't have that many uses
outside Deep Learning.

------
poorman
I've been trying to get through this paper for the last two days. It's
somewhat sparse itself. Maybe I need to go read the code they wrote first...

------
tkyjonathan
This is obvious to me. Since Hadoop came out, (a lot of) people have been
giving up on even forming algorithms and just dumping data into machine
learning and hoping for the best. I recall someone high up at Google
complaining about it.

We need to get back to forming algorithms as well as concepts and first
principles. We cannot and should not expect ML to brute force finding patterns
and just sit back and relax.

Here is another prediction for you: we will not solve ray-tracing in games and
movie CGI with more hardware. We will need some algorithm that gets us 80-90%
of the way there in a smart way.

~~~
taneq
This was my first thought. Well, to be more complete - smart algorithms beat
dumb algorithms even if the dumb algorithms use hardware acceleration (unless
the problem is trivial anyway.) Smart algorithms plus hardware acceleration
beats smart algorithms on general purpose hardware. Smart algorithms are just
better.

------
ourlordcaffeine
>We provide codes and scripts for reproducibility

Where? I want to get my hands on this code

~~~
garybake
paperswithcode has a link to the repo

[https://paperswithcode.com/paper/slide-in-defense-of-
smart-a...](https://paperswithcode.com/paper/slide-in-defense-of-smart-
algorithms-over)

------
mwexler
"But... Moore's Law, more hardware!" he plaintively cries out...

------
RawChicken
Sorry for the off topic comment but this then/than mistake I read every day is
just getting on my nerves.

" What to Know: Than and then are different words. Than is used in comparisons
as a conjunction, as in "she is younger than I am," and as a preposition, "he
is taller than me." Then indicates time. It is used as an adverb, "I lived in
Idaho then," noun, "we'll have to wait until then," and adjective, "the then
governor."" [1]

[1] [https://www.merriam-webster.com/words-at-play/when-to-use-
th...](https://www.merriam-webster.com/words-at-play/when-to-use-then-and-
than)

~~~
flohofwoe
For a non-native speaker it's hard to tell the two apart. Then and than sound
pretty much the same. And even if one knows the difference, it's easy to make
a typo.

In the sort of "international pidgin English" that's spoken anywhere outside
the UK such subtle differences should just be ignored.

~~~
boardwaalk
I don’t think you’ll get much play for suggesting (to an audience of at least
some programmers) that we should allow for more ambiguity in language, heh.

The programmers I’ve met without an eye for detail are usually ones I do not
like working with.

~~~
flohofwoe
Hehe, true, but unlike programming languages, the languages humans use for
communication are "sloppy" and ambiguous by definition. Grammar rules have
been invented after the fact to create the illusion that there's order where
none exists.

English allows much more "freedom" than many other languages (e.g. German),
maybe that's one reason why it has been so successful in the end.

