Hacker News new | past | comments | ask | show | jobs | submit login

It's likely GPUs are slower than CPUs for spatial data structures. Getting the data to GPU and results back takes just too long. Point in polygon is also very branchy in general case. GPUs are really bad with branchy code, so it's very unlikely you could even GPU accelerate such query, if it operates on point coordinates and a set of polygon vertexes. At least it would be very hard.

Edit: right, after thinking about it, the branches can be optimized out. It could be fast if there are a set of sorted segments, just parallel compares and some boolean logic.

Which leaves the problem of getting the data to GPU. Because you can definitely stream same comparisons on the CPU much faster (memory bandwidth limited) than you can stream the data to GPU over PCIe.

So 2 CPU socket system, such as Xeon E5, I'd bet on the CPU. PCIe 4.0, 16 lanes would give 30 GB/s (not sure if PCIe 4.0 is supported anywhere), vs. aggregate CPU bandwidth of up to 150-200 GB/s. Dual socket Xeon E5 supports at least 1 TB of RAM (16x 16 GB buffered DDR4). 32 GB DDR4 memory modules exist as well, I think, and larger DIMM banks than 16 slots can be supported. It's just the number of slots on typical 2 socket mainboards.

With more realistic setting, CPU would be even more ahead. All of this is ignoring GPU latency issues, which can be anywhere from microseconds to tens of milliseconds in pathological cases.

Unless the data was on the GPU in the first place... I think currently a single GPU can have up to 12 GB of RAM. Maybe larger GPUs exist too. That's just not much RAM compared to what is typical for CPUs. Currently smallest amount of RAM a dual socket Xeon E5v3 standard server can have is 64 GB, if all memory channels have at least one DIMM.




What's important to keep in mind is the progress that AMD has been making with their Heterogeneous System Architecture. With high-frequency RAM (especially when we start seeing widespread availability of DDR4) shared between a CPU and GPU that share a single die, all of this communications overhead goes away.

There is limited software support right now, because this architecture is very new, but on the benchmarks that take advantage of the on-die GPU, AMD's latest can keep up with and surpass much more expensive i7s. We're still at a point where it's unclear that AMD's HSA will take a commanding lead, but it's promising, especially considering the price/power requirements for an A10 (~$160 currently), vs the equivalent of a high end GPU and a Xeon.

You could, ignoring storage and peripherals (reasonable in a server farm arrangement) put together many more iGPU boxes than Xeon/dGPU boxes.

Even assuming moderate gains from GPU acceleration, high-throughput database servers could be made cheaper through this method.


Point in polygon actually seems like a good problem for the GPU. It can be calculated with a simple angle calculation performed in parallel against all segments. Trig functions are fairly heavy weight, and serially or with lesser degrees of threading on the cpu, creates a large computation, making it good for offloading to the GPU.


Point-in-polygon is usually done with simple vector arithmetic; not trig.


Surely a non-iterative angle calculating method would only work with convex polygons?


> vs. aggregate CPU bandwidth of up to 150-200 GB/s

for streaming data into a CPU you'll be lucky to get double digit bandwidths. peak figures are ~50gb/socket, but for anything more than a memcpy, it drops off like a cliff. then you also have NUMA issues, bank conflicts, TLB misses if your data is big enough..

i've written codes that sustain >270gb/s on high end GPUs - it's not trivial, but it can be done.

you are correct though, about the quantity of GPU memory available on an average GPU. the AMD S9150 has 16gb of ram. very high for a GPU, but nothing compared to high end servers.

> PCIe 4.0, 16 lanes would give 30 GB/s

afaik it's not in anything, so we're limited to 6gb/s for GPU <-> host.. :/

> With more realistic setting, CPU would be even more ahead.

depends. getting a good fraction of peak bandwidth on a GPU is fairly straightforward - coalesce accesses. some algorithms need to be.. "massaged" into performing reads/writes like this, but in my experience, a large portion of them can be.

getting a decent fraction of peak on a CPU is a totally different ballgame, however.

IMO, if the data can persist on the GPU, then this could be a big win.


Well, don't set NUMA to interleave! Instead set all of first socket's memory first, then all of second socket memory, etc. 2 MB/1GB pages (don't want TLB miss every 4kB!). DRAM wise prefetch for each memory channel, to cover DRAM internal penalties. I think DRAM bank switch penalties span 256 bytes, assuming 4 memory channels, every 4, 8 or 16 kB. Things are variable, that's what makes it hard and annoying. Don't overload a single memory channel. Worst case memory channel wise is read 64 bytes aligned, skip next 192 bytes. Again, assuming 4x [64-bit] memory channels per CPU socket. Correct me if I'm wrong, but I think a single memory channel fills a single 64-byte cache line.

And no matter what you do, don't write to same cache lines, especially across NUMA regions. Also avoid locks and even atomic operations. Try to ensure also PCIe DMA happens in local NUMA region.

I'm impressed of getting 100 GBbps CPU bandwidth. It's hard to avoid QPI saturation.


If the data can stay on the GPU, then it is likely a win. GPU have 8GB or more now. Thus it depends on how much polygon data one has.


Not necessarily: even if the data's on the GPU so doesn't have to pay the PCI-E transfer penalty, GPUs still have cache hierarchies and these have latencies as well, and they can be worse than for CPUs as the branch-predictors and pre-fetchers of GPUs are still fairly primitive in comparison to what CPUs are capable of, meaning access patterns on a GPU can actually matter quite a bit - you end up having to change block-size per GPU type and code which works on one very well doesn't work as well on another GPU.

However, point-in-polygon is a fairly simple algorithm, and if each polygon was mostly < 40 vertices, I suspect a GPU might be faster. However, for more complex algorithms, GPUs don't do as well and with many more vertices I suspect GPUs won't do as well for point-in-polygon tests.

In terms of raw theoretical FP processing power, GPUs look good - but when you start to do more complex things with them - i.e. when branching happens a lot, say with path tracing, they don't look as good. E.g. a dual Xeon 3.5 Ghz quad i7 (costing ~£950 each) is as fast at path tracing as a single NVidia K6000 costing ~£4100.


Here is a production quality path tracer that for most users runs noticeably faster than competing CPU-based renderer: https://www.redshift3d.com/

Pragmatically, it produces results of similar quality quicker than CPU-based competitors.

It is really taking the high end rendering world by storm this year.


Erm??...

That's a biased renderer that uses all sorts of caching and approximations that there's no CPU-based renderer that supports (VRay's closest with it's ability to configure primary and secondary rays using different irradiance cache methods), and as Redshift doesn't support CPU rendering, it's hardly a comparison worth talking about as you'd be comparing different algorithms. The pure brute-force without any caching numbers I've seen for it don't look any better than the other top CPU renderers doing brute-force MC integration.

Also, a quibble, but I guess by "high-end rendering world" you mean archviz (where VRay and 3DSMax are dominant) and a few small VFX studios who happen to be running Windows?


Your selling it a little short, pretty much everyone not doing feature films like game cinematics, commercials and product viz.


"Pretty much everyone"

Really?

I know companies like Blur are trialling it, but they're still using VRay. I know The Mill have done stuff with it, but they're still using Arnold too.


I mis-read, thought you were saying not many studios using windows and VRay, agree on Red Shift.


> Getting the data to GPU and results back takes just too long

I was involved in a database research project recently, and this is exactly what we found: sure, GPGPUs and the like are much faster than CPUs for the right database queries, but the transfer overhead is so absolutely horrendous that it completely dwarfs any gains in execution time.


Did you throw AMD APUs into the project, or does their lack of x86_64 per-core performance kill the advantage of the shared memory pool.


No AMD stuff. We just had a box with a couple of Xeons, a Xeon Phi, and a K20.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: