
BlazingDB uses GPUs to manipulate huge databases in no time - rezist808
https://techcrunch.com/2016/09/12/blazingdb-uses-gpus-to-manipulate-huge-databases-in-no-time/
======
jandrewrogers
Several companies have implemented databases on GPUs but there is a good
technical reason that the approach has never really caught on, and some of
these companies even migrated to selling the same platform on CPUs only.

The weakness of GPU databases is that while they have fantastic internal
bandwidth, their network to the rest of the hardware in a server system is
over PCIe, which generally isn't going to be as good as what a CPU has and
databases tend to be bandwidth bound. This is a real bottleneck and trying to
work around it makes the entire software stack clunky.

I once asked the designer of a good GPU database what the "no bullshit"
performance numbers were relative to CPU. He told me GPU was about 4x the
throughput of CPU, which is very good, but after you added in the extra
hardware costs, power costs, engineering complexity etc, the overall economics
were about the same to a first approximation. And so he advised me to not
waste my time considering GPU architectures for databases outside of some
exotic, narrow use cases where the economics still made sense. Which was
completely sensible.

For data intensive processing, like databases, you don't want your engine to
live on a coprocessor. No matter how attractive the coprocessor, the
connectivity to the rest of the system extracts a big enough price that it is
rarely worth it.

~~~
tw04
It would seem that rationalization would fall apart quickly with the new Power
CPUs that have NVLink built right into the CPU. Getting data back and forth
shouldn't be a problem anymore.

Outside of CPU >> GPU, I'm not sure what other data movement you could be
talking about. A SAS HBA or Ethernet NIC or Infiniband HBA are almost always
going to be operating over the same PCIe bus the GPU uses. In the rare
instances they're built onto the CPU, the "network link" is likely going to
still be significantly slower than the fastest PCIe slot.

~~~
marcell
What is NVLink, and what does it mean in terms of data transfer?

~~~
fulafel
It's a faster link between between future IBM POWER processors and NVidia
GPUs, but won't make waves in database market since those systems are niche
HPC/supercomputing hardware.

~~~
jzelinskie
I've also seen NVLink on some of the pre-Pascal roadmaps for nVidia's gaming-
oriented graphics cards. Since the current generation gaming consoles has HSA,
I'm hoping that it gains in popularity and because less niche. Problem is that
it'd be a pretty vital component to be nVidia proprietary.

~~~
snuxoll
CAPI is "open" solution to do the same thing, though NVLink supposedly still
offers more bandwidth. Unfortunately CAPI is only available on POWER8 hardware
at the moment and I'm not sure if IBM is open to license it beyond OpenPOWER -
still, there's a decent number of CAPI-capable cards already available and
more coming to the market in the future.

GPU's may not be great for every unit of work a RDBMS has to perform, but
given their ability to rapidly compute hashes it could help a lot with joins
(as evidenced by PGStrom).

------
hendzen
This has been tried many times. Unless you are doing a computation with
sufficient arithmetic intensity [0] the cost of shipping the data over PCIe
and back dominates any gain you might get over a CPU.

[0] - [http://www.nersc.gov/users/application-
performance/measuring...](http://www.nersc.gov/users/application-
performance/measuring-arithmetic-intensity/)

~~~
sitkack
You would need a GPU with an onboard SSD (exists). Or GPUs and drivers with
the ability to talk directly to infiniband hardware (exists).

Or the ability to not be ridiculously wasteful of existing resources (also
exists).

Your naysaying doesn't make you smart. Your naysaying makes you cut off from
learning a different, better way of doing things.

~~~
hendzen
Database workloads tend to be very branch heavy, e.g. string comparisons. If
you've ever programmed a GPU you know that taking a data-dependent branch
serializes the execution of the GPU stream processing units. So already the
70-100x benefit GPUs give on vectorized floating point workloads is greatly
reduced. Add in the memory bandwidth penalty and it's a complete waste for IO
intensive database workloads.

Save the GPUs for training neural nets and physics simulations. When the
fundamental hardware capabilities of GPUs make them worth investing in for
database workloads, I'll change my opinion. Until then, any budget for
expensive DB hardware is much better spent on NVMe storage.

~~~
zlynx
You can write code to compare strings without branches. Imagine comparing one
string to every string in a database column. Pad to a fixed size first (if you
find an ambiguous match later you can do a full string compare then). Subtract
the strings, aka compare, and store the result. At the end of all the
commparisons you have a memory structure of negative, zero, or positive
values. Then you can do something with the matching values.

There have been some _very sweet_ algorithms for doing B+ tree searches for
Itanium and for AVX and those work just as well, if not better on GPU.

With some thinking, branch operations can be converted into math and applied
to masses of data without checking for branch conditions. This wastes some
work but is still faster than branching.

~~~
hendzen
That sounds like an interesting technique - but it unfortunately does not
apply on a majority of real-world data.

In particular I mean sorting non-English text which any serious database needs
to be able to do. Subtracting characters doesn't work when you're dealing
language specific Unicode collation [0].

[0] -
[http://www.unicode.org/reports/tr10/](http://www.unicode.org/reports/tr10/)

~~~
zlynx
Yeah, well, you just can't do it that way and be fast at the same time. I mean
seriously fast.

What you can do is decide how your comparison should be collated and
preprocess your strings into a sort key form that is strictly big-endian
binary (first byte has highest weight). Or little-endian I suppose, whichever
works best for your hardware.

Your sort key doesn't have to be the actual string.

------
optimiz3
How do they guarantee the correctness of results? A major problem with GPUs is
you see single bit errors with surprisingly high frequency.

For graphics this usually doesn't matter as a minor color or vertex deviation
isn't noticeable, but for compute it can be devastating.

We do cryptocurrency mining on an industrial scale and constantly see single
bit errors from hardware that is brand-new without modifications.

~~~
n00b101
_We do cryptocurrency mining on an industrial scale and constantly see single
bit errors from hardware that is brand-new without modifications._

That's very surprising and interesting. How do you detect these single bit
errors?

~~~
optimiz3
Detection of false positives: Run candidate solutions through the CPU

Detection of false negatives: Compare solution distribution and frequency to
expected models; switch to debug kernels if outside tolerance.

However, this works because the mining problem space is stateless and follows
strict mathematically predictable models.

A DB is stateful and the answers generally can't be verified without
consulting a secondary copy, which is why I'm super curious how they would
engineer correctness and reliability in a cost-effective way using GPUs.

~~~
n00b101
That's very interesting. Have you collected statistical data on these bit
errors? Is it always a single bit error?

I'm assuming you are using GeForce cards and not Tesla cards which have an ECC
memory protection mode?

I've tried to collect some statistics on GPU memory errors rates but have
found them to be normally extremely rare. The only time I've reproducibly seen
them is due to faulty hardware, where the errors become highly reproducible
and the GPU needs replacement. The other theoretical cause of bit flips is
supposed to be random errors due to cosmic radiation but I've never been able
to observe that using memory testing software (though I did only run the
experiments in AWS).

Could it be that you have faulty or low grade GPUs? I assume these are all
low-cost OEM parts, given your application? Or maybe there's something odd
about your data center environment?

Regarding the GPU database application, I think the answer is to just use the
Tesla grade GPU with ECC memory enabled.

~~~
optimiz3
Generally we prefer AMD cards as most (profitable) mining functions are memory
bandwidth dominated. Usually it's a shader unit that gets unstable in the
70-80C range (note that most silicon is rated for higher ranges).

AMD's hardware specifications are more open too which lets one build your own
shader compilers and get direct access to the iron.

------
jhj
The numbers for GPU databases look “good” because you can get pretty high
cross-sectional bandwidth to a reasonably large memory from 8 GPUs in one box,
and advertise blazing speed from that. But it’s just a trick.

The only thing that matters for them here is the aggregate, cross-sectional
bandwidth to your data’s working set in memory. For databases, especially for
the approach that many GPU databases take (light on indexes since GPUs aren’t
great at data-dependent memory movement or pointer chasing, just brute-force
scan much of the data), the working set size is something that will only fit
in main memory.

Instead of using 8 GPUs with a peak global, cross-sectional memory b/w of 8 *
320 = 2560 GB/sec and brute-force scans, you can parallelize across ~40 CPU
nodes each with ~60 GB/sec b/w to main memory. The cross-sectional bandwidth
will be about the same, and the cost to split and join the results of the
query is likely small in comparison to actually doing the work, assuming the
intermediate results are reasonably small. You can use a broadcast and
reduction tree; the added latency of the broadcast and reduction tree's depth
likely won't add much, since there isn’t much data to broadcast in a query,
and the data returned by each machine for the reduction is hopefully (!) tiny
in proportion to the actual data scanned.

If you want to consider indices on data, then maybe the heads of that can
remain resident in a CPU’s cache, and will make the individual CPU scans even
faster. The GPU caches are tiny and mainly serve to patch up strided loads and
other bad uses of memory.

Whether or not it’s worth it one way or another depends upon how large your
database size is, the relative cost of GPUs versus CPU nodes to get the memory
you want and the cross-sectional b/w you need, perf/W and other issues.

You’re probably nowhere near close to arithmetic throughput bounds on GPUs or
CPUs since these workloads have very low op / byte loaded ratios compared to
typical HPC workloads, so that aspect of GPUs doesn’t matter. If you’re doing
expensive pre- or post-processing on GPUs as well, then that may push the
balance more towards GPUs.

~~~
felipe_aramburu
You bring up some really good points about badnwidth to the data and how many
of our competitors have gotten ridiculously high benchmarks. We do NOT cache
on the gpu, EVER. The reason why is because most of our columns are much too
large to fit on one or even 8 gpus let alone the rest of the space required
for processing on it.

which are why we actually prefer to have only 1 GPU per server when we are
making our own boxes. We find that our most optimal running environment is
when we have smaller instances with only 1 gpu. This is due to the fact that
two smaller rigs with a gpu each will benefit from increased CPU RAM
throughput (basically double 1 rig) and you have 2x the PCIE bandwidth since
you are splitting it across two machines. While we don't have the enviable op
/ byte loaded that some machine learning tool sets might use we are able to
greatly enhance these throughputs by using compression and doing things like
running multiple arithmetic operations in one kernel call.

------
zitterbewegung
It would be interesting to benchmark this against
[https://wiki.postgresql.org/wiki/PGStrom](https://wiki.postgresql.org/wiki/PGStrom)
.

~~~
mcphage
And MapD, too.

~~~
scottlocklin
Got pals at MapD. I'd like to see any of them benchmarked against Kx. Pretty
much all database problems are IO bound, and not many get it as right as Art
did.

------
fasteo
Every time I read about offloading work to the GPU, good old times come to my
mind.

I vividly remember the Intel 8087, a math co-processor to the Intel 8086 that
came out in 1980-1981. All the floating point arithmetic was offloaded to it.

It ended up disappearing as a separate chip with the Intel 80486 in the late
eighties.

[1]
[https://en.wikipedia.org/wiki/Intel_8087](https://en.wikipedia.org/wiki/Intel_8087)

~~~
gaius
Also
[https://en.m.wikipedia.org/wiki/Original_Chip_Set#Copper](https://en.m.wikipedia.org/wiki/Original_Chip_Set#Copper)

The good old days indeed!

------
kungfooguru
Sounds similar to what Netezza did with FPGA's but here with GPUs. So they may
hit patents they say they are unaware of...

~~~
biocomputation
Yeah, they need to be unaware of them...

