Another fascinating bit is that NVIDIA and ATI/AMD have developed what are now the largest general-purpose processors in the world (over 5 billion transistors per chip and counting - available in consumer GPUs for under $300, as opposed to Intel's largest Xeons that top out at 4 billion and cost $2000+) but are being held back at the 28nm process because their fab partner (TSMC) is oversubcribed by smaller, higher demand ARM chips that go into phones.
TSMC plans to begin 16nm finfet production early 2015.. although they're doing it so they can supply apple and keep up with samsung (who also supply apple and have plans for 14nm/16nm).
nvidia's parker is suppose to use finfet and they're a customer of TSMC.. but parker will be a 64-bit arm cpu for servers/mobile devices.
You still have to write your program in a very different way in order to run efficiently on GPUs as opposed to CPUs.
1) Code needs to have at least 1k, better 10k+ parallel 'threads'.
2) These threads should be largely data parallel (branching is possible but hurts performance more significantly than on CPU).
3) Registers and memory per thread are limited, around 30-60 registers and 400-800k memory are the limits to achieve a reasonable saturation. If you disregard this, spilling of memory will occur (or the memory will just run out, there's no swapping so it will just crash).
4) Because of (1) and (3), GPUs like so called 'tight loops', i.e. many parallel but smallish kernels.
3) Some GPUs have MMUs (and share their paging with the host CPUs)
3) By MMUs do you mean unified memory? Well yes, but for now this is so slow that you don't really want to use it. This might change on Power systems with nvlink and for Knights Landing generation Intel accelerators, but that's still in the future / not publicly available.
or on youtube: https://www.youtube.com/playlist?list=PL4A8BA1C3B38CFCA0
See also https://fgiesen.wordpress.com/2011/07/09/a-trip-through-the-...
I'm going to read this article anyway to hear their take & for the learning experience, but does anyone remember any of the counter-arg articles?
GPUs for general purpose computation were never 100x faster than CPUs like people claimed in 2008 or so. They're just not. That was basically NV marketing mixed with a lot of people publishing some pretty bad early work on GPUs.
Lots of early papers that fanned GPU hype followed the same basic form: "We have this standard algorithm, we tested it on a single CPU core with minimal optimizations and no SIMD (or maybe some terrible MATLAB code with zero optimization), we tested a heavily optimized GPU version, and look the GPU version is faster! By the way, we didn't port any of those optimizations back to the CPU version or measure PCIe transfer time to/from the GPU." It was utterly trivial to get any paper into a conference by porting anything to the GPU and reporting a speedup. Most of the GPU related papers from this time were awful. I remember one in particular that claimed a 1000x speedup by timing just the amount of time it took for the kernel launch to the GPU instead of the actual kernel runtime, and somehow nobody (either the authors or the reviewers) realized that this was utterly impossible.
GPUs have more FLOPs and more memory bandwidth in exchange for requiring PCIe and lots of parallel work. if your algorithm needs those more than anything else (like cache), can minimize PCIe transfer time, and handles the whole massive parallelism thing well, then GPUs are a pretty good bet. If you can't, then they're not going to work particularly well.
(now, if you need to do 2D interpolation and can use the texture fetch hardware on the GPU to do it instead of a bunch of arbitrary math... yeah, that's a _huge_ performance increase because you get that interpolation for free from special-purpose hardware. but that's incredibly rare in practice)
I am into audio DSP & am planning to port a couple of audio algorithms (lots of FFT & linear algebra) to run on GPU but haven't even gotten to it because I considered it a pre-mature optimization to this point. I'm sure it would improve performance, but nowhere near what GPU advocates would claim.
My biggest reason?
"PCIe transfer time to/from GPU", plus it would be unoptimized GPU code. Once you read a few of these papers it becomes painfully obvious that a lot of tuning goes into the GPU algorithms that offer anything more than a low single-digit factor of speedup. It's still very significant (cutting a 3 hour algorithm down to 1 would be huge) but if you're in an early stage of research it may be a toss-up over whether its better to just tune the algorithm itself / run computations overnight rather than going through the trouble of writing a GPU-based POC. Maybe if you have 1 or 2 under your belt its not such a big deal but for most of the researchers I know GPU algorithm rewrites would not be trivial. (I've been doing enterprise Java coding for about 2 years now so the idea isn't so intimidating now, but in a past life of mucking around with Matlab scripts I'm sure it would have been daunting).
I wrote a paper  on this in one particular domain (computational chemistry) more or less as a rebuttal to a paper that claimed enormous GPU speedups; it was a consequence of slow CPU code, not especially fast GPU code.
Anyway, Cortex-A15 is capable of 8 flops per cycle per core which puts it pretty good in theoretical efficiency for its likely power draw at current clocks.
And I never managed to get close to 8 ins per cycle on A15, but, for example, an FFT implementation on VC4 is pretty close to a theoretical performance limit. And a fully loaded 4-core A15 will draw far above 500mW anyway.
I read the article now, cool technical overview -- but basically all of these processor arch articles have a slant in all the paragraphs where they wax poetic (abstract, analysis/conclusion). I think it would be helpful for people to be aware of this...
AFAIK Nvidia (their company name is at the top of this paper btw in case u weren't paying attention) are trying to generalize their chips to the point where they can enter the CPU market, and Intel chips can render 3D graphics well enough to handle most games that are ~5 years old (since Haswell or maybe one or 2 gens before).
So this isn't a particularly slanted article but there is a fair amount of propaganda / contrived performance studies in this market... NVIDIA & Intel are vying for each other's core customer bases. Anyone interested in the field should dig up the articles that try to debunk perfomance myths as well as studying architecture overview.
(Some of the sentences in the last few paragraphs, for example, made me sorta queasy & would get shredded on Wikipedia.)
(I know there are VGA reimplementations available, and the VGA is quite well-documented, but that's more of a timing controller/dumb frame-buffer than a real GPU.)
Standalone code running on _not plugged into anything_ Radeon HD2400
authors blog: http://www.pixel.io/blog/
he never released any source, actually he had something posted to github, but made repo private
And some of the GPU vendors are publishing their datasheets specifically in a hope that an alternative open source driver stack will appear.
I'll admit I only know the basics of GPU architecture, so please forgive/correct me if I'm wrong about something. However, I am just too curious not to share.
I'll try to explain. A frame buffer is nothing but a bunch of 1s and 0s in memory, meanwhile a monitor is just a bunch of 1s and 0s in pixels. We currently have the GPU write to memory in parallel and we currently write pixels to a monitor serially (and therefore interlacing). However, given the similarity between memory and pixels, why then can't we optimize a GPU to (parallely) write to pixels instead of memory. To the extreme, you could optimize, your GPU to have 1 shader per pixel, and since the shaders all run on the same clock cycle, the whole monitor would update simultaneously. I think that would be really cool and more importantly efficient. In more practical terms you would probably have 1 GPU shader be responsible for some group of pixels (so you only need 1 shader per 4x3 pixels or per 16x9 pixels).
So, before you say it, I get you might disagree with me when it comes to desktop GPUs, since 1. the GPU memory needs to be close to RAM (you don't want to have the GPU memory be on the other side of a "long" cable) 2. You would like to update the hardware for a GPU separately from your monitor. However, in something like Mobile/Oculus, the form factor is so small/tightly coupled already, I'm surprised optimizations like this aren't being looked into.
Am I just not up to date? Is there something fundamentally wrong in my logic? Does getting rid of the frame buffer/interlacing, not provide as much of a boost to make this worth while?
Timing is another huge one. Imagine running 2 million wires (for a 1080p display) that have to all be the exact same length to within some tolerance.
The longer those wires gets the harder this gets. This is also another huge reason why the move to serial buses has happened. You can run 4 wires with really tight timings and the bits will fly, but if you try and run 16 wires all together, speed ends up dropping dramatically. Reality is that circuit boards don't have room for a large number of traces running in parallel of all the exact same length!
RAM is a huge exception to this, but extreme measures have been taken to enable this to happen, a good chunk of your Mobo is taken up getting the RAM connected, and memory controllers moved onboard the CPU in part to get RAM closer to the CPU to simplify traces,
Note this is all the perspective of a software guy who has to listen to the hardware team grumble for most of the day. :)
Then someone finds that they can add a bit of processing to that display to make it go just a bit faster...
Along similar lines, consider that CPUs have billions of transistors but only a little over a thousand pins.
The last 20 pages are on GPUs.
If you're interested in the architecture of a GPU this Berkeley ParLab presentation by Andy Glew from 2009 covers the basics of how the compute cores in modern GPUs handle threading. It's a subtle, but powerful, difference from SIMD or vector machines.
If you want to get into the details of how a GPU interfaces with the system and OS software, which is almost an entirely other animal, you may want to look at the Nouveau project to get oriented.
"Description: Could not connect to the requested server host. "
Is there any other link for the paper?