"Accelerating fully homomorphic encryption using GPU"
Very nice. This looks interesting. Even though I probably won't understand it, at least it means I wasn't that far off.
Hopefully we will see more of this sooner rather than later. FHE is one of those things that I think would be a game changer if it's cheap enough to use broadly.
Fully Homomorphic Encryption
CUDA is the most commonly-used compute language but it's NVIDIA-only. OpenCL is "universal" (sort of) but - being a general-purpose framework for heterogeneous compute rather than a task-specific GPGPU language - it's much more complex and lacks the libraries/etc that are available in CUDA. NVIDIA's OpenCL performance is all over the board, sometimes good sometimes fairly terrible (half of CUDA performance or less). Other users have nothing at all.
So inherently - any GPU acceleration targets only a subset of your users.
The other killer is launch overhead. Marshalling data into an appropriate structure, copying it over to the GPU, and copying it back takes a significant amount of time in many cases, independent of computation. This is really only viable when there is a significant amount of computation to be done, which inherently limits the number of applicable tasks.
In fact ideally you would have data living right on the GPU all the time and you just make queries against it. But then you run into memory limitations - most GPUs have ~4 GB of VRAM and that's only enough VRAM for a couple applications at most (remember, we need to do a lot of work which means a fairly large quantity of data). And also since you want memory allocations to be contiguous you tend to allocate aggressively, which really means only 1-2 applications can use it.
So practically speaking you now have a subset of your subset of users, the people who don't have another application running on their GPU at the same time.
You can see how all this pushes it towards very niche tasks, being performed one at a time.
The other approach is having an iGPU which can work in the same memory space as the CPU, like AMD does with HSA. That would lower the overhead quite a bit but right now that's a very limited subset of hardware.
Practically speaking, right now the most viable general approach for vector processing is SSE/AVX, and you can see how poor the uptake of AVX has been.
There's also major security concerns: GPUs aren't very conscious about things like memory segmentation or zeroing before re-allocation. As long as what you are doing is not blatantly illegal (writing to memory addresses that do not exist, etc) then it pretty much is just a single flat memory space that you can do anything with, and on non-ECC devices memory is only zeroed on a cold startup.
Think "microcontroller" here, not memory like you'd get in desktop-land.
Apps barely even use SIMD, even when it would clearly be beneficial.
> The other killer is launch overhead. Marshalling data into an appropriate structure, copying it over to the GPU, and copying it back takes a significant amount of time in many cases, independent of computation. This is really only viable when there is a significant amount of computation to be done, which inherently limits the number of applicable tasks.
This isn't that much of a problem in practice, because apps get slow precisely when there's a lot of computation to be done. Accelerating apps that are already fast isn't that interesting.
Do apps get slow enough for users to complain about? You bet they do…
> Let's say you accelerated something like browser DOM processing... imagine a malicious application being able to sniff that even after the fact.
This is bogus. Apps that can read framebuffer memory of the browser have already owned up the browser for all intents and purposes. Taking screenshots of the browser is game over for privacy to begin with. And because all major browsers on popular OS's composite using the GPU already, there's no more attack surface than there already was.
Yes, the vulnerability was always there, but unless you were doing bitcoin mining or something else then no applications would have been emplaced to exploit it.
I edited in a couple papers I dug up on various exploits that have been published using GPUs.
If I'm on desktop, I'm literally running apps that don't have to read the browser framebuffer: they can just ptrace() the browser and pull its memory out.
If I'm on mobile, then apps can't ptrace one another, but they do have full access to the GPU already, including GPGPU. For example, Apple's Metal has compute functionality.
Can you describe the specific attack scenario you had in mind?
- Allowing one user to sniff internal state/framebuffer of another running on the same host/VM.
- If full write access to one of the global/shared memory spaces of a live process can be obtained, potentially breaking VM sandboxing or jumping privilege followed by buffer overflow or other exploits.
I'm not sure why you think the existence of other attack surfaces matters at all. If you get right down to it X11 doesn't respect any boundaries at all and you can just sniff or keylog anything you want.
That's still not a good reason to throw up your hands and declare you're not going to have any security at all.
This is a relatively general problem with letting untrusted applications run GPGPU acceleration, and there are some relatively simple mitigatinos.
What exactly is your beef with zeroing memory before you reallocate it to another application? Or are you just being pedantic?
The launch overhead issue is also a real thing, despite some of the protestations in responses that it's not. The problem is that some computationally expensive operations are expensive and launched frequently from something non-GPU. So just because you 'have a lot of accelerable work to do in aggregate' doesn't mean that it will be structured nicely to accelerate. We had that problem as well; you might have tons of overall work that all parallelizes quite nicely in aggregate, but it's dribbling in from the network card or CPU in little units.
Consider where I'm coming from, though: apps don't use the GPU for simple 2D graphics (think rectangles and images) and barely ever use it for compositing, which are cases in which the GPU obviously performs very well. Sure, it's hard to do hard things on GPU. But apps don't even do easy things on GPU.
"MapD can scale to many simultaneous users. If the users are just executing hand-written SQL queries, this could be hundreds of users per server. If the users are analysts working with MapD Immerse...then you can expect tens of simultaneous users without experiencing any performance degradation."
There are similar limitations around update, insert, etc.
Not disputing the benefits, but it just isn't general purpose. It is for a relatively small number of users querying a huge dataset.
"We'll say that the following is true for the OLAP (online analytical processing) scenario...Queries are relatively rare (usually hundreds of queries per server or less per second)."
For example, GPUs suck at working on texts, or when there is a lot of divergence between processes.
For example, if you have several kernels that branch to
the TRUE side of the IF condition, and several that branch to FALSE, you're paying a huge performance penalty.
Unless of course you plan to load the complete database in GPU memory, which makes the whole approach unsuited for large databases.
Also, that 6 to 20x speedup is what you get on the commodity GPUs, the ones sold at a higher mark-up are quite a bit faster than that due to better comms options and more RAM.
Also more importantly, the efficiency goes up dramatically, which is usually infinitely more important than the raw performance.
