be noted that these datasets are very sparse, e.g., Delicious
dataset has only 75 non-zeros on an average for input fea-
tures, and hence the advantage of GPU over CPU is not
In other words, they got a good speedup on their problem, but it might not apply to your problem.
It is extremely sparse.
If you look at language modeling, then things there are even sparsier - typical neural language model has 1-from-several-hundredths-of-thousands for full language (for Russian, for example, it is in range of 700K..1.2M words and it is much worse for Finnish and German) and 1-from-couple-of-tens-of-thousands for byte pair encoded language (most languages have encoding that reduced token count to about 16K distinct tokens, see  for such an example).
The image classification task also has sparcity at the output and, if you implement it as RNN, a sparsity at input (1-from-256 encoding of intensities).
Heck, you can engineer you features to be sparse if you want to.
I also think that this paper is an example of "if you do not compute you do not have to pay for it", just like in GNU grep case .
Given all that I think it is a paper about combination of very clever things which give excellent results in a synergy.
Check figure 6. of : https://arxiv.org/pdf/1906.00091.pdf
Embeddings are used to lookup sparse features, so you have those pesky data-dependent lookups.
They may be a bottleneck, but the alternative is worse -- you can't fit complex models with large vocabularies into GPU memory using sparse one-hot encodings.
Technically, the sparse one-hot encoding is the most efficient in terms of memory footprint. You simply store the non-zero coordinates.
The problem in practice for GPUs is that sparse vector/matrix operations are too inefficient.
The whole point of something like this paper is to skip the entire 'densification' step and to directly deal with the sparse matrix input as a sparse matrix. The LSH is used in this paper improves on directly using SpMSpV, as that is also inefficient on CPUs, although to a lesser extent than GPUs.
You also will get much more meaningful embeddings from summing embeddings of part of the word.
See libraries like magnitude for proper embedding lookup implementations
Most real-world models don't use one-hot encodings of words -- they use embeddings instead, which are very dense vector representations of words. Outside of the fact that embeddings don't blow out GPU memory, they're also semantically encoded, so similar words cluster together.
For example, the embeddings produced from CBOW and skipgram word2vec models are strikingly different in cosine similarity sense - different classes of words are similar in CBOW and skipgram.
An example is facebook DLRM: https://arxiv.org/pdf/1906.00091.pdf
Like this I view it as clickbait.
CPUs, by contrast, have a thorough caching hierarchy that tends to focus on minimizing memory latency, so it doesn't take as long to do the second load compared to a GPU.
"SLIDE doesn’t need GPUs because it takes a fundamentally different approach to deep learning. The standard “back-propagation” training technique for deep neural networks requires matrix multiplication, an ideal workload for GPUs. With SLIDE, Shrivastava, Chen and Medini turned neural network training into a search problem that could instead be solved with hash tables"
In ELI5 or layman's terms: current GPU/TPU accelerators are specialized in doing very regular and predictable calculations very fast. In deep learning a lot of those predictable calculations are not needed, like multiplying with zero. This approach leverages that and only does the minimal necessary calculations, but that makes it very irregular and unpredictable. Regular CPUs are better suited for those kind of irregular calculations, because most other general software is that as well.
They should at least give rationale why GPU is not speeding up locality sensitive hash based algorithm. GPU's are used to compute fast hashes (they were used in Bitcoin mining once).
It's Intel sponsored research, but come on.
The bottleneck is the:
ptr = h(x); // trivial, ~0.5-2 cycles
bucket = *ptr; // --> ~200-300 cycles
el_ptr = bucket[i]
el = *el_ptr
Obviously the memory throughput will not be as high as with matrix calculations, but the algorithm could still be optimized for the GPU. GPU's can do random access and have large and fast memories. What is the difference in memory lines? 64-bytes vs 128-bytes?
Also, SHA256(d?) employed by ethash is, actually, quite long - 80 cycles, at the very least (cycle per round). In mining you can interleave mining computation for one header with loading required by computation of mining of another header and, from what I know, this is what CUDA on GPU will do.
The sheer amount of compute power makes ethash mining faster on GPU.
On GPUs the atom size is 32/64 bytes. So GPUs are always better than or equal to CPUs when it comes to small reads/writes.
It's true that the compute power of ethash is not negligible, but to give you one more data point: on equihash there is even less compute spent on hashing, and GPUs still dominate CPUs
- request to L1D cache, misses
- request to L2D cache, misses
- packet is dropped on the mesh network to access L3D, likely misses
- L3D requests load from memory from the memory controller, load is put in queue
- dram access latency ~100-150
- above chain in reverse
This is the best case scenario on miss, because there could be a DTLB miss on the address (which is why huge tables are crucial in the paper) or there could be dirty cache lines somewhere in other cores that trigger the coherency mechanism.
... might have to...
The thing is there's no benchmark for just neural-netting. Neural nets do well on a variety of image recognition tasks but the nets that do well as particular trained instances on particular data. And, moreover, it's not just doing well on a benchmark but generalizing well. Essentially, neural nets simply algorithms but paradigms (for good and ill).
Instead of the babble about CPU vs GPU, they should be talking the strengths and weaknesses of their approach to image recognition and related tasks. Their locality sensitive hash approach is certainly interesting and some parts of their approach might hypothetically be useful for dealing with the weaknesses of the neural net approach (fragility, black-box-unexplainability, etc).
So I don't understand why the authors don't mention the possibility of implementing SLIDE on GPU. Of course, I could have missed something (I spent less than 10min reading the paper)...
Hash tables produce lots of data-dependent random accesses into DRAM which are definitely not better on GPUs. Warps divergence, bank conflicts, partial cache-line access inefficiencies, etc.
Even CPUs struggle on this type of pointer chasing workloads due to cache inefficiencies. Open addressing schemes such as robin hood hashmaps are popular because they reduce the amount of pointer chasing.
Your example is a false comparison, generating a hash is very different from pointer chasing the address generated by the hash.
That's not true at all in the case of ethash, where the running time is dominated by waiting for memory read ops to complete, not waiting for ALU ops (hashing) to finish executing.
I have also written an Equihash miner where the workload is similar: running time dominated by hashtable reads or writes, and can confirm GPUs beat CPUs.
I reiterate: GPUs excel at data-dependent random reads, compared to CPUs. Sure, these are very difficult workloads for both CPUs and GPUs, but GPUs still trump CPUs. That's because the atom size (minimum number of bytes that a GPU/CPU can read/write on the memory bus) is the same or better on GPU: 64 bytes on CPU (DDR4), and 32/64 bytes on GPU (HBM2), and GPUs have a significantly higher memory throughput up to 1000 GB/s while CPUs are still stuck around 200 GB/s per socket (AMD EPYC Rome 8-channel DDR4-3200).
So in ethash or Equihash mining workloads, many data-dependent read ops across a multi-GB data structure (much larger than local caches) will be mostly bottlenecked by (1) the maximum number of outstanding read ops the CPU/GPU memory controller can handle and (2) the overall memory throughput. In the case of GPUs, (1) is not really a problem, so you end up being bottlenecked by overall memory througphut. That's why GPUs win.
As of 3-4 years ago I remember Intel CPUs having a maximum number of 10 outstanding memory operations so (1) was the bottleneck. But things could have changed with more recent CPUs. In any case, even if (1) is not a bottleneck on CPUs, their lower memory throughput guarantee they will perform worse than GPUs on such workloads.
My expertise might be outdated here, but the problem used to be that actually getting to that max bandwidth through divergent warps and uncoalesced reads was just impossible.
Is this still the case with Volta? Did you avoid these issues in your Equihash implementation?
Uncoalesced reads are not a problem severe enough to make GPUs underperform CPUs. Or, said another way, uncoalesced reads come with a roughly equally bad performance impact on both GPUs and CPUs.
(I'm sure you know this, just wanted to clarify.)
So ~12GB total for those datasets.
To quote Wikipedia: "Ethash uses an initial 1 GB dataset known as the Ethash DAG and a 16 MB cache for light clients to hold. These are regenerated every 30,000 blocks, known as an epoch."
You generate it once, access many times.
If each memory access is 32-bytes, then one need not more than 32*2^20 computations to get completely sequential access. To very probably not miss a prefetched DRAM block (main throughput bottleneck), which is 4K..8K in typical size, one need about 128 times as less computation threads. And this number (~250 thousands threads) is well within reach for current GPU models.
State of the NN weights will change after each batch, on the other hand.
I see no reason not to try it with some AMD Threadripper or EPYC instead.
Might be worth trying on something like an 10940x/10980xe if you can get your hands on them.
What is interesting here is that in their current implementation they aren't very beneficial  and .
 https://www.sciencedirect.com/topics/computer-science/scatte... (recommends these instructions to be used outside of main loop)
I remember vaguely that first implementations of scatter/gather instructions were not faster than sequential access from different memory registers.
And, thusly, it may come handly that AMD has much bigger core count because each thread will have less memory to access.
So if it's 3.5 times faster then you require only roughly 30% of the time (or hardware) it took before, both savings combined seem pretty significant to me.
Edit: quick napkin math, let's take your $7k figure for the GPU and 4115 for the CPU. We need one third of that CPU so 4115 * 0.3 = 1234.5 now 1234.5 / 7000 = approx. 0.176.
18% of what it cost before...
Not quite. Further in the paper: "Similarly, for larger Amazon dataset, the red and black line never intersect, and the red and blue line intersects on 8 cores. This means that SLIDE beats Tensorﬂow-GPU with as few as 8 CPU cores and Tensorﬂow-CPU with as few as 2 CPU cores."
Yes, they have a beast machine, but it can outperform the GPU with only 8 cores in some cases.
Nvidea lists 300W as the TDP for the V100.
But in all cases, I would not expect that a serious research can outright beat the current NNs by 10x on all dimensions. This will take some time (and may not fully happen) and this paper is certainly a great advance.
Availability is pretty much non-existent though. Also you can OC it pretty close to 5GHz with decent water cooling.
Eh? It was invented multiple decades prior to the 1990s. Some days you could get the impression that computer science did not exist before the Web.
This is for recommenders only, and it does not translate to anything else. I don't know why everyone seems to misrepresent this as NVIDIA's undoing. Read the freaking paper, people.
The workloads that they test on make it hard to quantify the broad applicability of their work.
Typo in the submission title - it should say "Than" on GPU
But it’s a good try. I’d say that this catch’s my interesting as is a valid point to optimizations https://arxiv.org/pdf/1908.05858.pdf
In practice, TPUs and SoCs with inference extensions are a much bigger threat for NVIDIA in the cloud and automotive business.
We need to get back to forming algorithms as well as concepts and first principles. We cannot and should not expect ML to brute force finding patterns and just sit back and relax.
Here is another prediction for you: we will not solve ray-tracing in games and movie CGI with more hardware. We will need some algorithm that gets us 80-90% of the way there in a smart way.
Where? I want to get my hands on this code
What to Know:
Than and then are different words. Than is used in comparisons as a conjunction, as in "she is younger than I am," and as a preposition, "he is taller than me." Then indicates time. It is used as an adverb, "I lived in Idaho then," noun, "we'll have to wait until then," and adjective, "the then governor."" 
Edit: Clearly this is a contentious comment, and even those of us that see things this way mostly seem to agree that we understand the intended meaning (but things get fuzzer with smaller numbers expressed as percentages e.g. "120% faster"). Surely, though, it makes more sense to use the completely precise phrasing "3.5x as fast", especially for the main statement of the main result in an academic paper.
I don't see how that's the case. What is "50% bigger"? I understand it as 150% of the size. Similarly, I would understand "90% bigger" to mean "190% of the size" and "100% bigger" to mean "200% of the size", and I'm sure that is how they are used in practice.
So then surely "200% bigger" means "300% of the size"? And "two times bigger" - which is mathematically identical to "200% bigger" - would be three times the size. I acknowledge that the phrase is often used like you say but I don't think that is its literal meaning, and if you used that in a contract I think it would legally be interpretted in the way I've said (I'm thinking of consumer law in a situation where something is "n times bigger for the same price").
All this applies analagously to speed and x times faster. I gave the examples above with size because I think it's a bit more of a common thing to talk about this way and speed is a bit more subtle because what we're measuring here is the time which is something / "speed" and here the numerator isn't clear (number of training runs perhaps).
Interestingly, in Russian "100% bigger" and "two times bigger" will have different prepositions, so it's more clear that in the first case you do sum (x + 100%x = 2x), and in the second case you do multiplication (2 x = 2x).
I'm quite sure it's the same in English (sum with % and multiplication with times), but I'm not a native English speaker.
So I agree with the article, "3.5 times faster (1 hour vs. 3.5 hours)" is perfectly correct, and it is OK to abbreviate "3.5 times" as "3.5x".
but really, relative to something else, i guess it would be 0.33x as fast, because it would take two times more time, or 3 times as long
The meaning of "two times faster" as "twice as fast" is certainly the way such a statement would generally be interpreted everywhere I've worked.
It is of course possible that the meaning suggested by quietbritishjim is archaic British, but I certainly don't believe it's current: I've worked in various places Cambridge and London for the past 18 years and, as I say, have never encountered it.
I absolutely agree with that, and in practice if I ever see that turn of phrase with anything more than 100% then I assume that they are using it the way that you're thinking of. But I maintain this is not the literal meaning. At the same time, I'm not saying people should be subtracting 100% to make the number mathematically correct, that would definitely be bizarre. Instead, I'm saying they shouldn't be using such a weird turn of phrase in the first place, so the headline should simply be "Deep learning on CPU 3.5x as fast as on GPU"
> I've worked in various places Cambridge and London for the past 18 years and, as I say, have never encountered it.
You have really never encountered an item in a supermarket saying "now 20% bigger!"? Thinking about it now, they're usually charging the same amount as the old size (otherwise it's not much to brag about really) in which case they use the vastly clearer phrase "20% extra free", but I'm sure I've seen the former phrase.
If even that.
I think "3.5x faster" to mean "3.5x as fast" is fairly common, pretty clear, and very understandable to anyone but daft grammar prescriptivists.
> A personal note: I know the disparagement of Times-er from long ago, from my grade school years, I think; I was taught that two times more than X really means 'three times as many as X'. Since authority figures insisted on this interpretation, I avoided the construction entirely (as, as far as I know, I still do). Yet I've never stopped asking, "Why don't you understand the clear meaning of what people are saying?"
>> That last reaction incorporates one criticism of Times-er, namely that it is "illogical" or "irrational": X times more than Y MUST MEAN 'Y plus X-times-Y (that is, 'X+1 times Y'), not 'X times as many/great as Y' (that is, 'X times Y'). (In the most common variant of this reaction, X times more than Y is disparaged because it is said to be ambiguous, with both the 'X times Y' and 'X+1 times Y' interpretations.)
>> The appeal here is to the idea that ordinary-language expressions are simply realizations of logical (or arithmetical) formulas. This is just backwards. The formulas are there to represent the meanings of expressions; they are not the prior reality, merely cloaked in (those devilishly vague) words of actual languages.
For example you also wouldn’t say -0.5 times faster, it just doesn’t make sense.
Another pet peeves:
"irregardless" isn't a word. It's regardless.
If you don't care about something, you "couldn't care less about it". It's simple logic and yet people mess it up all the time.
Please don't quiz me on how to read bear, pear, tear, fear, spear, clear, and dear.
I don't get angry at non-standard usage but I think it's important not to ignore the purpose of consistent style.
Indeed, because they come from the same root - the difference only came from English peasants and their wacky spellings.
also off-topic, but i heard the word senglish, for singaporean english, on a podcast yesterday. that made me realize how english has the potential to become a universal language with each country having their own version.
french people already use the term frenglish when they mix english words with french. we could have spanglish, japenglish, germanish, etc. they don't have to be called those though.
it would be totally awesome to be able to communicate with almost everyone in the world. just like the internet!
In the sort of "international pidgin English" that's spoken anywhere outside the UK such subtle differences should just be ignored.
Not to mention that this specific kind of mistakes (similar sounds) are at least as often from native speakers as from non-native in my experience/native language.
In French, a lot of people mistake Ça for Sa for example, native and non-native alike
As a matter of fact, it shouldn't: the sentence with "then" in it has a different meaning altogether than the one with "than" in it.
Using "then" suggests that something is done on the CPU and then ("then") on the GPU.
The programmers I’ve met without an eye for detail are usually ones I do not like working with.
English allows much more "freedom" than many other languages (e.g. German), maybe that's one reason why it has been so successful in the end.