Hacker News new | past | comments | ask | show | jobs | submit login
KGPU — Accelerating Linux Kernel Functions with CUDA (code.google.com)
97 points by Tsiolkovsky on May 6, 2011 | hide | past | web | favorite | 34 comments

This is almost certainly nuts.

The latency to and from GPUs is awful. I've been hacking GPGPU since before CUDA existed (worked with an early beta of CUDA and the GTX 8800). You can do some great throughput-oriented stuff there, but the latency issues meant that GPUs were useless for small-to-medium tasks. It's partly an issue of latency and also an issue of getting 'enough data to make parallelism useful' - we had a pattern matching task that hit peak throughput at about 16K threads which required, of course, a big pile of data.

Things may have improved in terms of latency since then, but we're talking multiple orders of magnitude off (back then) for network processing tasks, much less kernel tasks. And the issue of needing 'big piles of data' to work over isn't going away. This is algorithm-dependent, of course, but lots of data is the easy way to find data-parallelism. :-)

Most of the papers published about GPGPU are either larger data sizes (where GPGPU is a legitimate tactic) or do some really serious handwaving concerning latency (e.g. the GPU routing stuff, 'packetshader'). Just because there are a bunch of people who can get interesting papers out of it doesn't mean it's a good idea.

Sandia uses GPUs for calculating parity in software RAID, http://www.computer.org/portal/web/csdl/doi/10.1109/ICPP.201...

Well, other than crypto what is there really to do that is more efficient in the GPU?

I don't think it needs to be more efficient than the CPU to merit moving to the GPU. If the horsing around to get the data in and out is less work than doing the job, then you may was well put the GPU to work and improve your total throughput.

Perhaps the memory page deduplication candidate detection could run out there. It would be memory bound, but maybe by not ruining the CPU cache it would be a win. (This is important for systems running a bunch of virtual machines.)

"the horsing around to get the data in and out" seems to be the key factor. An analysis of BLAS libraries' performance across several architectures [1] showed that GPU-based calculation only approached implementations like Goto BLAS with matrix dimensions well up into the thousands. That's just one example, but there seems to be a fair bit of overhead in getting the data to and from the GPU.

[1] http://dirk.eddelbuettel.com/blog/code/gcbd/

Calculating error correction code, though efficiency depends on the memory architecture.

I heard Tsubame, a supercomputer built with NVIDIA GPUs, calculated ECC on its GPU-side memory with GPU code because those GPUs were consumer grade and didn't have hardware ECC.

Routing: http://shader.kaist.edu/packetshader/

This was really non-obvious to me.

It's non-obvious because it's a bad idea. This paper comes from a wacky world where latency and power consumption don't matter. The comparisons between CPU vs. GPU aren't that compelling just on the surface of it. The latency/power consumption numbers (compared to dedicated ASICs for this sort of thing) are just laughable.

Being the most compelling 'software router' is sort of like being the 'tallest midget' but even in this domain, I think their alleged advantages over CPU-only are mainly due to carefully massaging the presentation of the data.

OCR.(Optical character recognition) Picture recognition - face detection. Speech recognition. Speech synthesis. Video recognition.

Multi touch gestures and handwriting recognition.

Phiber Optik (Mark Abene) had a pretty interesting talk yesterday at NY Hacker about using CUDA for intrusion detection calculations.

RAID checksums computation looks like an obvious possibility. We'd need a battery backup for the VRAM, too :)

A single core can hash (checksum) 5 GB/s using murmurhash. The data you checksum is probably already in L1/L2 cache (write to RAID) or going to be used by userland, and us reading the data will just mean userland process gets its data from cache instead (read from RAID). You can get maybe 2-6 GB/s to GPU. Add the latency (sync, etc.) and GPU time to calculate the hash, you've probably radically slowed down the process. Additionally, assuming DMA transfer, your memory subsystem is more stressed due to both CPU and GPU reading same data.

Oh, and simple xor? Well, assuming data is in L2 already, Intel i7 can xor 10+ GB/s using just single core, between 3 buffers, aka minimum RAID 5. Fastest RAID adapters can achieve only a fraction of that speed.

I think this is very memory intensive. Remember the GPU would have to calculate block checksums preferably from the main memory where the buffer resides.

Maybe block deduplication could be done this way. If the block is a dupe, skipping its allocation on the disk (it would save at least one block write) could offset a lot of block hash calculations.

Working with polynomials. That's important in Computational Geometry


Calculating Viterbi paths for Hidden Markov Models is faster by an order of magnitude or two than doing it on the CPU. I worked on porting NVIDIAs OpenCL implementation to a more 'platform neutral' version for the research project I'm involved in.

Here are some more examples:


There are many, many applications beyond crypto.

I think the question is about how the kernel can use the GPU. Linux probably doesn't need to train hidden Markov models. It might, however, need to do crypto (e.g., for an encrypted filesystem).

Oh gosh you're right. I wasn't thinking about the context in which the question was posed. Anyway, hopefully someone will find those examples interesting. NVIDIA's CUDA developer zone is chock full of great resources for GPGPU (like video lectures and tools and code examples).

Since I first heard about CUDA and ran the N body simulation with hundreds of bodies in real time on a cheap GeForce 8500 I became aware of how good of a parallel processor these things are.

Of course, had I learned about how the rendering pipeline for graphics works it would have been obvious, but that only came later.

After that, I've always wondered how feasible it would be to write an OS for regular users (there are CUDA supercomputers, but that isn't very representative of how most people make use of computers) that makes use of the GPU for various computations other than graphics. Hopefully, this project will shed some light in that direction.

What are the differences between using CUDA and OpenCL in this scenario? I was under the impression CUDA is Nvidia-only.

Yeah, basically OpenCL is a standardized version of CUDA.

CUDA has a much simpler API (and your entire program can be written in CUDA C) because it makes assumptions about the hardware you're using. OpenCL is for 'computing on heterogeneous platforms' meaning that if you do it right, your code can run on multithreaded CPUs, NVIDIA GPUs, ATI GPUs, or any device that conforms to the OpenCL standard. The tradeoff is that you have to write a lot of accessory code to make sure you're computing on the right device and conforming to the capabilities of that device... and these decisions have to be made at runtime unless you specifically know the platform and devices attached to the host but then your code isn't very portable.

So for a proof of concept it makes sense to go with CUDA so you can play with the algorithms and not worry about all the other stuff.

Matrix multiplication is the only thing GPUs are useful for... Basically, parallel mathematical operations.

It's interesting, but the applications to this are seriously limited and very specific.

Graphic apps already use the GPU. The OS mostly keeps data structures and calls functions.

That may be a bit harsh. If your input size is sufficently large, I would bet that encryption would get a pretty big boost. I know convolution operations are a lot faster (not sure where this would happen in a kernel though).

I think people often forget to factor in data transfer time. Matrix multiplication is 98x faster on my gpu than on my cpu, but I don't actually break even on real world time until the dimensions are up around 2000 or so.

True. A talk I recently attended quoted results of a Gammatone filterbank implementation for audio processing using CUDA (which was properly parallelized) and compared it with IPP (Intel Integrated Performance Primitives) on the CPU, which turned out to be a lot faster due to the cost of transfering the data to the GPU in the CUDA implementation.

That's not true. You can do all kinds of things related to polynomials on GPUs with nearly optimal speedup.


...which the kernel and OS have no need for.

That's the entire point, that there is no usefulness to this outside of very specific applications.

Honestly I don't like Linux, FreeBSD, Windows.. etc where everything under the sun is stuffed into a privileged domain and is stagnating to the point where hardware common 10 years ago is just becoming useful w.r.t the kernel. I can only hope that something new will come along and simply multiplex the hardware and have abstractions as a 'library' without having to hack through n subsystems to add feature x.

What's your plan for helping with this effort?

Eventually finish working on a programming language that can facilitate this. Probably create something similar to MIT's old exokernel research project.

Internet tough programmer?

Ever since 2007/2008 GPUs and CPUs are converging, and it is inevitable that at a certain moment even some kernel computations are best left to the GPU.

Regarding security, these days, GPUs have memory isolation and protection. This is how the drivers work: they give a process its own memory-mapped command queue, and only permissions to use part of the GPU memory which they've allocated.

So although it does increase the attack service to GPU+CPU instead of just CPU, it is not a matter of 'simply stuffing everything into the privileged domain'.

Something new came along 20 years ago with microkernels.

But wouldn't it be beautiful if you could abstract away the hardware so that the drivers can be uniform? I call it the driver-driver.

No it wouldn't. That way lies "one size fits nobody very well" computing. Cross platform app users already suffer a lot from that problem.

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact