It's likely GPUs are slower than CPUs for spatial data structures. Getting the d...

greggyb · on Dec 23, 2014

What's important to keep in mind is the progress that AMD has been making with their Heterogeneous System Architecture. With high-frequency RAM (especially when we start seeing widespread availability of DDR4) shared between a CPU and GPU that share a single die, all of this communications overhead goes away.

There is limited software support right now, because this architecture is very new, but on the benchmarks that take advantage of the on-die GPU, AMD's latest can keep up with and surpass much more expensive i7s. We're still at a point where it's unclear that AMD's HSA will take a commanding lead, but it's promising, especially considering the price/power requirements for an A10 (~$160 currently), vs the equivalent of a high end GPU and a Xeon.

You could, ignoring storage and peripherals (reasonable in a server farm arrangement) put together many more iGPU boxes than Xeon/dGPU boxes.

Even assuming moderate gains from GPU acceleration, high-throughput database servers could be made cheaper through this method.

1Bad · on Dec 23, 2014

Point in polygon actually seems like a good problem for the GPU. It can be calculated with a simple angle calculation performed in parallel against all segments. Trig functions are fairly heavy weight, and serially or with lesser degrees of threading on the cpu, creates a large computation, making it good for offloading to the GPU.

colanderman · on Dec 23, 2014

Point-in-polygon is usually done with simple vector arithmetic; not trig.

7952 · on Dec 23, 2014

Surely a non-iterative angle calculating method would only work with convex polygons?

foxhill · on Dec 23, 2014

> vs. aggregate CPU bandwidth of up to 150-200 GB/s

for streaming data into a CPU you'll be lucky to get double digit bandwidths. peak figures are ~50gb/socket, but for anything more than a memcpy, it drops off like a cliff. then you also have NUMA issues, bank conflicts, TLB misses if your data is big enough..

i've written codes that sustain >270gb/s on high end GPUs - it's not trivial, but it can be done.

you are correct though, about the quantity of GPU memory available on an average GPU. the AMD S9150 has 16gb of ram. very high for a GPU, but nothing compared to high end servers.

> PCIe 4.0, 16 lanes would give 30 GB/s

afaik it's not in anything, so we're limited to 6gb/s for GPU <-> host.. :/

> With more realistic setting, CPU would be even more ahead.

depends. getting a good fraction of peak bandwidth on a GPU is fairly straightforward - coalesce accesses. some algorithms need to be.. "massaged" into performing reads/writes like this, but in my experience, a large portion of them can be.

getting a decent fraction of peak on a CPU is a totally different ballgame, however.

IMO, if the data can persist on the GPU, then this could be a big win.

vardump · on Dec 24, 2014

Well, don't set NUMA to interleave! Instead set all of first socket's memory first, then all of second socket memory, etc. 2 MB/1GB pages (don't want TLB miss every 4kB!). DRAM wise prefetch for each memory channel, to cover DRAM internal penalties. I think DRAM bank switch penalties span 256 bytes, assuming 4 memory channels, every 4, 8 or 16 kB. Things are variable, that's what makes it hard and annoying. Don't overload a single memory channel. Worst case memory channel wise is read 64 bytes aligned, skip next 192 bytes. Again, assuming 4x [64-bit] memory channels per CPU socket. Correct me if I'm wrong, but I think a single memory channel fills a single 64-byte cache line.

And no matter what you do, don't write to same cache lines, especially across NUMA regions. Also avoid locks and even atomic operations. Try to ensure also PCIe DMA happens in local NUMA region.

I'm impressed of getting 100 GBbps CPU bandwidth. It's hard to avoid QPI saturation.

bhouston · on Dec 23, 2014

If the data can stay on the GPU, then it is likely a win. GPU have 8GB or more now. Thus it depends on how much polygon data one has.

berkut · on Dec 23, 2014

Not necessarily: even if the data's on the GPU so doesn't have to pay the PCI-E transfer penalty, GPUs still have cache hierarchies and these have latencies as well, and they can be worse than for CPUs as the branch-predictors and pre-fetchers of GPUs are still fairly primitive in comparison to what CPUs are capable of, meaning access patterns on a GPU can actually matter quite a bit - you end up having to change block-size per GPU type and code which works on one very well doesn't work as well on another GPU.

However, point-in-polygon is a fairly simple algorithm, and if each polygon was mostly < 40 vertices, I suspect a GPU might be faster. However, for more complex algorithms, GPUs don't do as well and with many more vertices I suspect GPUs won't do as well for point-in-polygon tests.

In terms of raw theoretical FP processing power, GPUs look good - but when you start to do more complex things with them - i.e. when branching happens a lot, say with path tracing, they don't look as good. E.g. a dual Xeon 3.5 Ghz quad i7 (costing ~£950 each) is as fast at path tracing as a single NVidia K6000 costing ~£4100.

bhouston · on Dec 23, 2014

Here is a production quality path tracer that for most users runs noticeably faster than competing CPU-based renderer: https://www.redshift3d.com/

Pragmatically, it produces results of similar quality quicker than CPU-based competitors.

It is really taking the high end rendering world by storm this year.

berkut · on Dec 23, 2014

Erm??...

That's a biased renderer that uses all sorts of caching and approximations that there's no CPU-based renderer that supports (VRay's closest with it's ability to configure primary and secondary rays using different irradiance cache methods), and as Redshift doesn't support CPU rendering, it's hardly a comparison worth talking about as you'd be comparing different algorithms. The pure brute-force without any caching numbers I've seen for it don't look any better than the other top CPU renderers doing brute-force MC integration.

Also, a quibble, but I guess by "high-end rendering world" you mean archviz (where VRay and 3DSMax are dominant) and a few small VFX studios who happen to be running Windows?

Varcht · on Dec 23, 2014

Your selling it a little short, pretty much everyone not doing feature films like game cinematics, commercials and product viz.

berkut · on Dec 23, 2014

"Pretty much everyone"

Really?

I know companies like Blur are trialling it, but they're still using VRay. I know The Mill have done stuff with it, but they're still using Arnold too.

Varcht · on Dec 23, 2014

I mis-read, thought you were saying not many studios using windows and VRay, agree on Red Shift.

sjolsen · on Dec 24, 2014

> Getting the data to GPU and results back takes just too long

I was involved in a database research project recently, and this is exactly what we found: sure, GPGPUs and the like are much faster than CPUs for the right database queries, but the transfer overhead is so absolutely horrendous that it completely dwarfs any gains in execution time.

rodgerd · on Dec 24, 2014

Did you throw AMD APUs into the project, or does their lack of x86_64 per-core performance kill the advantage of the shared memory pool.

sjolsen · on Dec 24, 2014

No AMD stuff. We just had a box with a couple of Xeons, a Xeon Phi, and a K20.