I had to get down to the graph to realize they're talking about SVM, not deep learning.
This could be pretty cool. Training a SVM has usually been "load ALL the data and go", and sequential implementations are almost non-existent. Even if this was 1x or 0.5x speed and didn't require the entire dataset at once it's a big win.
there's still a ton of usage for classical learning algorithms. I'd be a very happy camper if we could speed SVMs up by a magnitude
Indeed, for relatively "simple" models, SVM can get very, very close to deep learning accuracy for classification, with only a fraction of the computing time needed.
A product from an IBM consultant is about as related to a product from IBM Watson as is a product from Microsoft being related to a product from Apple.
Sure they might have some divisions that do better, but I have yet to see them.
Ultimately, it seems like IBM has managed to make a generalized gather/scatter operation over large datasets in this particular task. Yes, this is an "old problem", but at the same time, its the kind of "Engineering advancement" that definitely deserves talk. Any engineer who cares about performance will want to know about memory optimization techniques.
As CPUs (and GPUs! And Tensors, and FPGAs, and whatever other accelerators come out) get faster and faster, the memory-layout problem becomes more and more important. CPUs / GPUs / etc. etc. are all getting way faster than RAM, and RAM simply isn't keeping up anymore.
A methodology to "properly" access memory sequentially has broad applicability at EVERY level of the CPU or GPU cache.
From Main Memory to L3, L3 to L2, L2 to L1. The only place this "serialization" method won't apply is in register space.
The "machine learning" buzzword is getting annoying IMO, but there's likely a very useful thing to talk about here. I for one am excited to see the full talk.
It does seem specific to machine learning / Tensors. But that's still cool. I'll have to sit down and grok the paper more carefully to fully understand what they're doing.
I wonder if there is a similar lower-bound which can be constructed for non convex problems which retain enough properties for this method to be useful?
If only! In fact it's a struggle to utilize a GPU to its full potential because the communication bottleneck makes it infeasible. Compute is fast but data can't get there fast enough.
The authors of this paper were saying the same thing in the promo video, in fact, they were working on making GPU's more efficient. Why would they do that if GPU's are using their "full potential" already?
Not really. A modern Coffee Lake i7 has several distinct advantages over GPUs. (AMD Ryzen also has similar advantages, but I'm gonna focus on Coffee Lake)
1. AVX2 (256-bit SIMD), for 32-bit ints / floats that's 8 operations per cycle. AVX512 exists (16 operations per cycle) but it its only on Server architectures. Also, AVX512 has... issues... with the superscaling point#2 below. So I'm assuming AVX2 / 256-bit SIMD.
2. Superscalar execution: Every Skylake i7 (and Coffee Lake by extension) has THREE AVX ports (Port0, Port1, and Port5). We're now up to 24-operations per cycle in fully optimized code... although Skylake AVX2 can only do 16 Fused-multiply-adds at a time per core.
3. Intel machines run at 4GHz or so, maybe 3GHz for some of the really high core-count models. GPUs only run at 1.6GHz or so. This effectively gives a 2x to 2.5x multiplier.
So realistically, an Intel Coffee Lake core at full speed is roughly equivalent to 32 GPU "cores". (8x from AVX2 SIMD, x2 or x3 from Superscalar, and x2 from clock speed). If we compare like-with-like, a $1000 Nvidia Titan X (Pascal) has 3584 cores. While a $1000 Intel i9-7900 Skylake has 10 CPU cores (each of which can perform as well as 32-NVidia cores in Fused MultiplyAdd FLOPs).
i9-7900 Skylake is maybe 10x slower than an Nvidia Titan X when both are pushed to their limits. At least, on paper.
And remember: CPUs can "act" like a GPU by using SIMD instructions such as AVX2. GPUs cannot act like a CPU with regards to latency-bound tasks. So the CPU / GPU split is way closer than what most people would expect.
A major advantage GPUs have is their "Shared" memory (in CUDA) or "LDS" memory (in OpenCL). CPUs have a rough equivalent in L1 Cache, but GPUs also have L1 cache to work with. Based on what I've seen, GPU "cores" can all access Shared / LDS memory every clock (if optimized perfectly: perfectly coalesced accesses across memory-channels and whatever. Not easy to do, but its possible).
But Intel Cores can only do ~2 accesses per clock to their L1 cache.
GPUs can execute atomic operations on the Shared / LDS memory extremely efficiently. So coordination and synchronization of "threads", as well as memory-movements to-and-from this shared region is significantly faster than anything the CPU can hope to accomplish.
A second major advantage is that GPUs often use GDDR5 or GDDR5x (or even HBM), which is superior main-memory. The Titan X has 480 GB/s (that's "big" B, bytes) of main memory bandwidth.
A quad-channel i9-7900 Skylake will only get ~82 GB/second when equipped with 4x DDR4-3200MHz ram.
GPUs have a memory-advantage that CPUs cannot hope to beat. And IMO, that's where their major practicality lies. The GPU architecture has a way harder memory model to program for, but its way more efficient to execute.
The other point to add is that SIMD operation for GPUs is what gives them efficient batched reads from GPU memory for each operation.
I can't say I'm an expert yet. But the more and more I read about highly optimized code on any platform, the more and more I realize that 90% of the problem is dealing with memory.
Virtually every optimization guide or highly-optimized code tutorial spends an enormous amount of time discussing memory problems. It seems like memory bandwidth is the singular thing that HPC coders think about the most.
If you don't have enough computations-per-byte to perform on the GPU, you will find your total job time starts to be dominated by the time it takes to stage data in and out of the GPU, without being able to keep the GPU cores busy. Even if the CPU is 5-10x slower according to issue rates and RAM bandwidth, it can keep calculating steadily with a higher duty cycle since system RAM can be much larger.
However, the CPU also benefits from locality, so you should still prefer to structure your work into block-decomposed work units if possible. A decomposition which allows you to work through a large problem as a series of sub-problems sized for a modest GPU RAM area will also let the sub-problem rise higher in the CPU cache hierarchy to get more effective throughput. However, if the decomposition adds too much sequential overhead for marshalling or final reduction of results, it may not help versus a monolithic algorithm with reasonably good vectorization/streaming access to the full data.
That way you get pretty good memory bandwidth, can directly access much more ram (1TB easy), and you can run a wide variety of codes (not just GPU codes).
Sure the Titan X is great if your code A) doesn't communicate B) fits entirely in system memory and C) runs on CUDA. Of course the real world often intrudes with PCI-e latency and memory limitations.
Not saying GPUs don't have their place, but it's easy to overstate their usefulness.
If you know the name of a Xeon Skylake-server, and its memory capacity, that is roughly $1000 (and therefore comparable to a Titan X in MSRP cost), you are welcome to rerun the analysis yourself.
I can't do that because I don't know the capabilities of the Xeon Skylake servers from memory, nor their prices. And I'm certainly not going to spend 30 minutes googling this information for other people's sake.
What I will say is that the i9-7900x is a Skylake-server part with AVX512 support and Quad-channel memory. That's way stronger than a typical desktop. And I think assuming Quad-Channel 4xDDR4-3200MHz is pretty fair, all else considered.
If the GPU were truly 1000x more efficient than the CPU, then the CPU vendor could just take 1/1000th of a GPU and squeeze it onto their own chip to double their performance.
(In a sense the trend since the late 90's has been to do exactly this via vector extensions.)
The paper in discussion here reports 10x speedup for GPU vs CPU.
Article was touched by pr dept, but still has actually information.
They did the same thing that has been done for thousands of years. Back then the hot area of research was how to stage advance food and resource caches along a route for long journeys. They came up with algorithms to optimize cache hits.
In this case, the problem is GPUs can be fast for ML, but usually only have 16GB ram when dataset can be terabytes.
Simple chunk processing would seem to solve the problem, but it’s turns out overhead of cpu/gpu transfers badly degraded performance.
Their claim here is they can on the fly determine how important different samples are, and make sure samples that yield better results are in the chance more often than those with less importance.
I mean, we all love the magic, but I think we're getting spoiled as of late with all the magic AI/Deep Learning stuff coming out.
My goal in a tldr is only to minimize the number of seconds it takes to digest some essential concept.
I wish for every article here someone would write up a 1 sentence tldr and a one paragraph tldr+, to help us track more happenings in our head at once and to help choose the ones we decide to spend our deep reading time on.
But of course your point is valid, shoulders of giants and what have you...