If I were going to write a microkernel OS today, and was going to rely entirely on the GPU for rendering, could I get away with putting the graphics stack entirely in user-space, or would it still be relatively slower?
In Linux systems good chunk of graphics stack already resides in userspace. The kernel provides only modesetting, GPU memory management, access multiplexing and isolation.
Applications render by calling OpenGL library, which obviously executes in calling processes. OpenGL library asks the kernel to allocate required memory buffers, prepares code for the GPU and submits it to the kernel for execution. Kernel collects jobs submitted by processes, executes them on the GPU and notifies processes on completion.
When application wants to display something on the screen, it shares its buffer with the X server (or equivalent) and instructs it to redraw its window from this buffer. The display server uses some special syscall to obtain access to screen output buffer.
Remember the news that L4D2 on Linux outperformed the Windows version? As I understand it, the L4D2 code base uses Direct 3D, and on Linux it uses a D3D emulator with OpenGL as the backend, and yet after going through these user-space hoops it still gets a better framerate. The best theory I've heard is that context switching is cheaper on Linux.
"D3D emulator with OpenGL as the backend" sounds an awful lot like Wine. There've been reports of Wine apps running faster than their Windows equivalent too, although usually some wine bug or another gets in the way of performance.
This wouldn't surprise me. Based on the console output, Counterstrike Source does the same thing. Lots of output about Direct3D that look exactly like messages from wine. I suspect they did this to speed up porting.
Usually the user-space overhead is real tiny, I can't imagine it would be needed to place it in the kernel. Linux has the high-level graphics stack in user-space and it works just fine.
This (plus some low-level register poking) is the stuff you would need to implement in your GPU daemon. Actual generation of drawing commands, compilation of shaders etc is performed by applications.
The overhead should be in the order of few context switches to the GPU server for each frame rendered by any application (in Linux it's few syscalls instead of context switches). Probably not terrible.