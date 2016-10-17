Hacker News new | comments | show | ask | jobs | submit login
Pushing a Trillion Row Database with GPU Acceleration (nextplatform.com)
Those people who said GPUs would never be any use for general-purpose computing tasks have been looking a bit silly lately. Perhaps they had forgotten about the advantages of math coprocessors in earlier hw architectures.


Unfortunately, despite GPUs being ubiquitous, they get very little use in the desktop/mobile space, aside from games. Apps use them as little more than blitters, if they even do that much. There's a ton of computing power we're leaving on the table.

Of course, improving this is a big opportunity for innovation, so perhaps it's not so unfortunate after all :)


I wonder if at some point (hopefully not that far from now) there's some breakthrough or development that would allow FHE in GPU.

I'm guessing maybe that could be a front where the unused GPU power can be taken advantage of in desktop/mobile, besides desktop.

I could imagine some sort of algorithm/driver/framework plus a protocol that would allow remote services to make queries on your personal files (without privacy issues), while being accelerated by the client's GPU.

Obviously I'm talking out of my behind here but I hope maybe someday someone will figure it out.


There has already been some academic work on accelerating fhe using GPUs, and it does indeed make a big difference. However, it's still orders of magnitude slower than native computation.


Excuse my ignorance, but what do you mean by FHE?


Likely Fully Homomorphic Encryption, e.g.,

"Accelerating fully homomorphic encryption using GPU" http://ieeexplore.ieee.org/document/6408660/


Aha!

Very nice. This looks interesting. Even though I probably won't understand it, at least it means I wasn't that far off.

Hopefully we will see more of this sooner rather than later. FHE is one of those things that I think would be a game changer if it's cheap enough to use broadly.


Yes, sorry.

Fully Homomorphic Encryption

https://en.wikipedia.org/wiki/Homomorphic_encryption


Thanks!


There's two killers, compatibility and launch overhead.

CUDA is the most commonly-used compute language but it's NVIDIA-only. OpenCL is "universal" (sort of) but - being a general-purpose framework for heterogeneous compute rather than a task-specific GPGPU language - it's much more complex and lacks the libraries/etc that are available in CUDA. NVIDIA's OpenCL performance is all over the board, sometimes good sometimes fairly terrible (half of CUDA performance or less). Other users have nothing at all.

So inherently - any GPU acceleration targets only a subset of your users.

The other killer is launch overhead. Marshalling data into an appropriate structure, copying it over to the GPU, and copying it back takes a significant amount of time in many cases, independent of computation. This is really only viable when there is a significant amount of computation to be done, which inherently limits the number of applicable tasks.

In fact ideally you would have data living right on the GPU all the time and you just make queries against it. But then you run into memory limitations - most GPUs have ~4 GB of VRAM and that's only enough VRAM for a couple applications at most (remember, we need to do a lot of work which means a fairly large quantity of data). And also since you want memory allocations to be contiguous you tend to allocate aggressively, which really means only 1-2 applications can use it.

So practically speaking you now have a subset of your subset of users, the people who don't have another application running on their GPU at the same time.

You can see how all this pushes it towards very niche tasks, being performed one at a time.

The other approach is having an iGPU which can work in the same memory space as the CPU, like AMD does with HSA. That would lower the overhead quite a bit but right now that's a very limited subset of hardware.

Practically speaking, right now the most viable general approach for vector processing is SSE/AVX, and you can see how poor the uptake of AVX has been.

There's also major security concerns: GPUs aren't very conscious about things like memory segmentation or zeroing before re-allocation. As long as what you are doing is not blatantly illegal (writing to memory addresses that do not exist, etc) then it pretty much is just a single flat memory space that you can do anything with, and on non-ECC devices memory is only zeroed on a cold startup.

Think "microcontroller" here, not memory like you'd get in desktop-land.

https://arxiv.org/pdf/1305.7383.pdf

https://petsymposium.org/2017/papers/issue2/paper10-2017-2-s...

https://cryptome.org/2013/09/gpu-keylogger.pdf


I'm aware of the challenges, having worked in the GPU-on-client space for years. But they can all be overcome. There's no good reason for the status quo other than plain inertia.

Apps barely even use SIMD, even when it would clearly be beneficial.

> The other killer is launch overhead. Marshalling data into an appropriate structure, copying it over to the GPU, and copying it back takes a significant amount of time in many cases, independent of computation. This is really only viable when there is a significant amount of computation to be done, which inherently limits the number of applicable tasks.

This isn't that much of a problem in practice, because apps get slow precisely when there's a lot of computation to be done. Accelerating apps that are already fast isn't that interesting.

Do apps get slow enough for users to complain about? You bet they do…

> Let's say you accelerated something like browser DOM processing... imagine a malicious application being able to sniff that even after the fact.

This is bogus. Apps that can read framebuffer memory of the browser have already owned up the browser for all intents and purposes. Taking screenshots of the browser is game over for privacy to begin with. And because all major browsers on popular OS's composite using the GPU already, there's no more attack surface than there already was.


The increase in attack surface is that you weren't running GPGPU apps with potential access to the browser framebuffer before.

Yes, the vulnerability was always there, but unless you were doing bitcoin mining or something else then no applications would have been emplaced to exploit it.

I edited in a couple papers I dug up on various exploits that have been published using GPUs.


> The increase in attack surface is that you weren't running GPGPU apps with potential access to the browser framebuffer before.

If I'm on desktop, I'm literally running apps that don't have to read the browser framebuffer: they can just ptrace() the browser and pull its memory out.

If I'm on mobile, then apps can't ptrace one another, but they do have full access to the GPU already, including GPGPU. For example, Apple's Metal has compute functionality.

Can you describe the specific attack scenario you had in mind?


- Allowing an unprivileged/sandboxed/jailed process to sniff internal state/framebuffer of a privileged process.

- Allowing one user to sniff internal state/framebuffer of another running on the same host/VM.

- If full write access to one of the global/shared memory spaces of a live process can be obtained, potentially breaking VM sandboxing or jumping privilege followed by buffer overflow or other exploits.

I'm not sure why you think the existence of other attack surfaces matters at all. If you get right down to it X11 doesn't respect any boundaries at all and you can just sniff or keylog anything you want.

That's still not a good reason to throw up your hands and declare you're not going to have any security at all.

This is a relatively general problem with letting untrusted applications run GPGPU acceleration, and there are some relatively simple mitigatinos.

What exactly is your beef with zeroing memory before you reallocate it to another application? Or are you just being pedantic?


The "third killer" is parallelism. Not all workloads have available parallelism to make the GPU effective - and some have the parallelism only in theory. I work in regular expression acceleration (obplug: Hyperscan) and have done regex-on-GPU in the past. One of the problems was that the easiest way to get parallelism for regex was to throw a lot of data at the problem, so you could have your 16K CUDA threads. However, hitting peak throughput performance with 2MB input buffers is a bit of a problem if you typical workload is 64-1500 bytes and you want the results as quickly as possible.

The launch overhead issue is also a real thing, despite some of the protestations in responses that it's not. The problem is that some computationally expensive operations are expensive and launched frequently from something non-GPU. So just because you 'have a lot of accelerable work to do in aggregate' doesn't mean that it will be structured nicely to accelerate. We had that problem as well; you might have tons of overall work that all parallelizes quite nicely in aggregate, but it's dribbling in from the network card or CPU in little units.


Regex acceleration is hard for sure when most regexes are small.

Consider where I'm coming from, though: apps don't use the GPU for simple 2D graphics (think rectangles and images) and barely ever use it for compositing, which are cases in which the GPU obviously performs very well. Sure, it's hard to do hard things on GPU. But apps don't even do easy things on GPU.


Upvote for the term blitter which reminds me of my Amiga and Atari days...


For the uninformed...

https://en.wikipedia.org/wiki/Blitter


Not that I disagree, but to be fair, query processing on GPUs is a highly specialized task, and not quite what people think when they say "general purpose". A lot of work goes into writing code that utilizes that particular architecture well.


Because of the memory ceiling it doesn't scale well for lots of concurrent users either...

"MapD can scale to many simultaneous users. If the users are just executing hand-written SQL queries, this could be hundreds of users per server. If the users are analysts working with MapD Immerse...then you can expect tens of simultaneous users without experiencing any performance degradation."

There are similar limitations around update, insert, etc.

Not disputing the benefits, but it just isn't general purpose. It is for a relatively small number of users querying a huge dataset.


This is an issue even with non-GPU-based OLAP engines. See, e.g., Clickhouse (https://clickhouse.yandex/reference_en.html):

"We'll say that the following is true for the OLAP (online analytical processing) scenario...Queries are relatively rare (usually hundreds of queries per server or less per second)."


Apples and oranges measures of users vs query rate, but it sounds still like conventional OLAP can handle a lot more.


Possibly, but I suspect that by "users," MapD really means "query rate." I can't see how a larger number of users would have a performance impact if all other variables were constant (data, query, query rate, etc.).


It's an in-memory database. "Huge dataset" is questionable in this scenario as it has to fit in RAM.


GPUs aren't good for everything. The right thing to do is balance between CPU and GPU code, and only do what actually gives any benefit.

For example, GPUs suck at working on texts, or when there is a lot of divergence between processes.

For example, if you have several kernels that branch to the TRUE side of the IF condition, and several that branch to FALSE, you're paying a huge performance penalty.


How is a query going to be faster on a GPU, when a query just performs lookups in tables, and simple comparison operations. You'd have to get the data via the CPU to the GPU in the first place, so you might as well perform the lookup/comparison on the CPU instead.

Unless of course you plan to load the complete database in GPU memory, which makes the whole approach unsuited for large databases.


This is for analytics workloads, where you stream rows through the GPU.


To be honest, although GPUs are super cool, I have to admit a bit of disappointment in practice. I don't know about databases, but in the machine learning community ones sees usually about 6x to 20x speedup on GPU. This is amazing, of course, but far below what, way back, naively thought we'd be able to get from GPUs -- i.e., 1000 processors instead of 1, we should get a 1000x speed-up, right? Not so at all. So many architectural issues (i/o, memory, even sometimes algorithmic limitations) get in the way of achieve true parallel linear performance.


GPUs typically clock at roughly half of what a regular desktop CPU will do so there's a factor of two right there. Furthermore the data has to pass through another bottleneck (twice!), the PCIe bus. Finally, you need to keep in mind that all this is subject to Amdahl's law, you still have a sequential component that you won't be able to speed up at all.

Also, that 6 to 20x speedup is what you get on the commodity GPUs, the ones sold at a higher mark-up are quite a bit faster than that due to better comms options and more RAM.


Eh, Amdahl's law usually isn't all too applicable to highly parallelizable tasks, check out Gustafson's Law for a more modern perspective on the parallelization paradox, courtesy of supercomputers being forced to parallelize.


Each CUDA core can tackle approximately one FLOP per cycle, while a CPU core can tackle approximately 16 FLOPs per cycle. Given that there are usually at least 6 cores on a chip, and that reaching full utilization is difficult on a GPU, I dont' think ~20X is too shabby.

Also more importantly, the efficiency goes up dramatically, which is usually infinitely more important than the raw performance.


You can also throw extra GPUs at the PCI bus, not so with extra processors.


No GPU-accelerated search-engine yet ?


while it isn't a GPU exactly, Bing is built on top of FPGAs. https://blogs.microsoft.com/next/2016/10/17/the_moonshot_tha...


Would be pretty expensive at scale and much has been done to optimize CPU only for search.


Google uses neural nets extensively for search, those run on TPUs and Bing uses fpga for acceleration, presumably also for neural networks.

Both are more efficient than a gpu.


That is for inference. The neural networks are trained using GPUs.


Well yes, but I would assume that this case would be for inference, since GPUs are pretty much train once and deploy everywhere (granted, hyperparameter tuning is still a point).




