The latency to and from GPUs is awful. I've been hacking GPGPU since before CUDA existed (worked with an early beta of CUDA and the GTX 8800). You can do some great throughput-oriented stuff there, but the latency issues meant that GPUs were useless for small-to-medium tasks. It's partly an issue of latency and also an issue of getting 'enough data to make parallelism useful' - we had a pattern matching task that hit peak throughput at about 16K threads which required, of course, a big pile of data.
Things may have improved in terms of latency since then, but we're talking multiple orders of magnitude off (back then) for network processing tasks, much less kernel tasks. And the issue of needing 'big piles of data' to work over isn't going away. This is algorithm-dependent, of course, but lots of data is the easy way to find data-parallelism. :-)
Most of the papers published about GPGPU are either larger data sizes (where GPGPU is a legitimate tactic) or do some really serious handwaving concerning latency (e.g. the GPU routing stuff, 'packetshader'). Just because there are a bunch of people who can get interesting papers out of it doesn't mean it's a good idea.
Perhaps the memory page deduplication candidate detection could run out there. It would be memory bound, but maybe by not ruining the CPU cache it would be a win. (This is important for systems running a bunch of virtual machines.)
I heard Tsubame, a supercomputer built with NVIDIA GPUs, calculated ECC on its GPU-side memory with GPU code because those GPUs were consumer grade and didn't have hardware ECC.
This was really non-obvious to me.
Being the most compelling 'software router' is sort of like being the 'tallest midget' but even in this domain, I think their alleged advantages over CPU-only are mainly due to carefully massaging the presentation of the data.
Multi touch gestures and handwriting recognition.
Oh, and simple xor? Well, assuming data is in L2 already, Intel i7 can xor 10+ GB/s using just single core, between 3 buffers, aka minimum RAID 5. Fastest RAID adapters can achieve only a fraction of that speed.
Maybe block deduplication could be done this way. If the block is a dupe, skipping its allocation on the disk (it would save at least one block write) could offset a lot of block hash calculations.
Here are some more examples:
There are many, many applications beyond crypto.
Of course, had I learned about how the rendering pipeline for graphics works it would have been obvious, but that only came later.
After that, I've always wondered how feasible it would be to write an OS for regular users (there are CUDA supercomputers, but that isn't very representative of how most people make use of computers) that makes use of the GPU for various computations other than graphics. Hopefully, this project will shed some light in that direction.
So for a proof of concept it makes sense to go with CUDA so you can play with the algorithms and not worry about all the other stuff.
It's interesting, but the applications to this are seriously limited and very specific.
Graphic apps already use the GPU. The OS mostly keeps data structures and calls functions.
I think people often forget to factor in data transfer time. Matrix multiplication is 98x faster on my gpu than on my cpu, but I don't actually break even on real world time until the dimensions are up around 2000 or so.
That's the entire point, that there is no usefulness to this outside of very specific applications.
Regarding security, these days, GPUs have memory isolation and protection. This is how the drivers work: they give a process its own memory-mapped command queue, and only permissions to use part of the GPU memory which they've allocated.
So although it does increase the attack service to GPU+CPU instead of just CPU, it is not a matter of 'simply stuffing everything into the privileged domain'.