
MapD Open Sources GPU-Powered Database - anc84
https://www.mapd.com/blog/2017/05/08/mapd-open-sources-gpu-powered-database/
======
ccleve
I would very much like to see a blog post or some other explanation of how a
GPU speeds up a database query.

Most traditional SQL databases are I/O-bound, meaning that the cost of pulling
index pages from disk into memory overwhelms everything else, and adding more
CPU doesn't do much good. Even for in-memory databases the cost of pulling a
page into GPU memory is likely to be high. The calculations you do are pretty
simple -- you mostly traverse arrays of pointers and do intersections and
accumulations. No floating point, just increment and compare integers.

GPUs really shine when you can parallelize operations on blocks of data that
are already in GPU memory. That's why they're great for graphics and games.
I'm not seeing how they help with database queries over large volumes of data
which cannot fit in GPU memory.

I'd really love to read some kind of tutorial on the topic. There's likely
some kind of magic here that I don't see.

~~~
paulmd
They don't help on large volumes of data which cannot fit into GPU memory.
Obviously. PCIe is even slower than standard CPU memory, pulling non-trivial
amounts of data across the PCIe bus is catastrophic for any GPU program.

GPUs shine because their VRAM has enormous volumes of bandwidth (up to ~512
GB/s for high-end modern GPUs). That's an order of magnitude more than CPU
memory.

So basically you throw indexes and perhaps a few key columns in VRAM, but as
little actual data as possible, and just return the row IDs (or ranges) of
interesting data. That lets you cross-reference back into host memory, and at
that point you would perhaps perform some further limitation criteria based on
something that's not available in GPU memory.

GPU memory isn't super-abundant, but high-end cards have up to 24 GB, and if
you stack a bunch of them in a rack then you are talking about a reasonable
amount of memory for most use-cases. Four cards in a box is easy, and you can
put four cards in a 1U chassis if you use GP100 mezzanine cards. So now we're
talking 96GB of VRAM per box or per 1U, which works out to quite a bit of
index data.

The other use-case would be something that's really computationally intensive
to perform, but for some reason cannot be pre-computed and cached. So - for
example find me all rows where "sha256(INTEGER my_table.my_val AND now() AS
INTEGER) > 256". Which as you can imagine mostly trends towards crypto kinds
of stuff.

~~~
dogma1138
Pascal supports upto 49bit memory addressing which covers 512TB worth of
addresses, AST between the GPU and system memory addresses is done in hardware
which means prefetches via usage hints / branch prediction and page faults are
handled by the GPU.

In this case you are only bound by PCIE bandwidth (or NVLINK if you are using
a supporting GPU) and performance wise it's pretty near native speeds.

You can also access network and other types of storage from within the GPU
since MIMO is universal these days.

AMD will add similar functionality starting with VEGA.

This means that with both vendors you are no longer bound by the GPU onboard
memory.

And if PCIE x16 isn't enough then NVLINK comes with upto 200GT/s these days
and has an infiniband end point.

~~~
paulmd
The problem is "PCIe bandwidth" is really bad. PCIe 3.0x16 is only 16 GB/s,
which again is less than half of you would get from any old DDR4 memory. You
might as well do that on the CPU then.

NVLink is fine as far as it goes... the problem is it's a two-party
transaction, NVLink still has to be fed _by something_ and that's either
another GPU or the CPU or something like a SSD via DMA. With a database type
program, the CPU and GPU use-cases are pointless because you could just be
doing the search there and returning results directly instead of device-to-
device DMA. And NVMe SSDs are usually limited to PCIe 2.0x4 which is 2 GB/s.
Absolute maximum, 16 GB/s, same as any other PCIe-attached device.

Remember, GPUs aren't always used on compute-bound tasks. There is a very
valid use-case for I/O bound tasks where you are just using it as a super-
cache. GPU database programs often (but not always) fall into that category.
You are just searching indexes/target columns like crazy and throwing the
resulting row IDs back over the wall to the CPU.

~~~
dogma1138
NVLINK is GPU to GPU or GPU to CPU link, if you are using it as a GPU to CPU
link you are taking to the CPU at native NVLINK speeds either through a native
NVLINK bus if you are using a POWER CPU which supports it or via an
Infiniband/OmniPath interconnect on Intel/AMD CPUs.

16GB/s is a bit tight but that's where prefetches based on pre-defined usage
hints as well as the branch predictor in the GPU and the CUDA Compiler/Runtime
that feeds the GPU without stalling as much as possible.

OFC you can I/O bound a GPU that's is actually pretty easy these days even in
normal (gaming) workloads (e.g. draw call limit).

Let's say you are building a 4 GPU DB, if you are using NVLINK you each GPU
can see the entire memory space of the other GPUs connected on the same NVLINK
(there is also RDMA which can be done over the network but this is a
complicated scenario) in this case you have 96GB of high speed VRAM + which
ever amount of memory you can put on your CPU which can be upto 1.5TB or so
per socket these days, you also have 64GB/s of bandwidth to feed that 96GB of
memory with which means that it takes a second and a half to completely refill
the memory completely (and I'm still not sure that the same lossless
compression schemes that GPUs use these days for won't work for your data
which would give you much higher effective bandwidth that the
compression/decompression if free on the GPU side).

With these figures I can't see a reason why you can't optimize your memory
residency to have the best of both worlds fast key-value/hash or index lookups
for IO bound tasks and then programmatically prefetching the workload or if
the branch prediction is good enough just letting it roll in processing bound
tasks.

So yes it's not "perfect" but nothing ever will, the question is it much
faster or even orders of magnitude faster for certain tasks / workloads /
implementations than a CPU only DB and clearly it is. How well would it
compete against other contenders like Intel's Xeon Phis, programmable FPGAs
and other emerging technologies that aim to solve this problem from another
direction i don't know.

~~~
paulmd
Even if you have a super-fast interconnect, you can only fill as fast as
something else can send it, right?

I mean, if it's in CPU memory than you are limited to whatever the CPU memory
can deliver. Which is 40 GB/s per socket from quad-channel DDR4, more or less.
You can probably double that with compression, but it's still vastly less than
you'd get from VRAM.

And you get the same speedups from _storing the data compressed in VRAM_
assuming you have some fast method for indexing into the stream, or you're
searching the whole stream. But yes, assuming your data can be dehydrated and
then stored in CPU memory you can squeeze a little more bandwidth out of it
this way.

If it's GPU memory (in another GPU across NVLink/GPUDirect/etc), then you
should be doing the searching on that GPU instead.

The stipulation here has always been "tasks that don't fit into GPU memory
aren't going to work" so if you are assuming that the data lives on another
GPU... then you are violating the pre-condition of all of this discussion,
because it fits into GPU memory.

Yes, if your data fits into GPU memory it's very fast, because VRAM bandwidth
is enormous. The catch is always getting it to fit into GPU memory. That's why
you need a "tiered" system - indexes or important columns live on the GPU,
then you return them to the CPU where you do additional processing with them.
It's like L1/L2/L3/memory hierarchy on a CPU - sure, your program that fits
into cache is super fast, but many real-world tasks don't fit into cache.

If your entire dataset is small enough that it fits in VRAM on one box's worth
of GPUs (4-8 per box), or within a few racks or whatever... that's great, go
hog wild. But it's not very cost effective versus a tiered approach, this is
like asking for a CPU with 16 GB of cache (or enough CPUs to add up to 16 GB
of cache). It's outright impossible for very large datasets (96-192 GB is not
a very large dataset in this context, but it's a reasonable amount of _index
space_ for a 1.5 TB or 3 TB dataset on the CPU socket).

Once you start having to transfer stuff into and out of VRAM, GPU performance
typically starts to rapidly degrade. It's "only a second or two" to you... but
that's a terabyte's worth of VRAM bandwidth that sat idle for that time.
"Rolling" approaches like this do not work very well on GPUs. You want to
avoid transferring stuff on or off as much as possible because you just can't
do it fast enough.

Computational intensity is one way to avoid doing that, both in VRAM and to
host memory. If you can transfer a byte (per core) every P cycles, and you
perform at least P cycles' worth of computation on it... you're compute bound.
And at that point you can probably make a "rolling" algorithm work. But the
problem is doing that (without just being obviously wasteful). In practice, P
is like 1 byte every 72 cycles or something (working from memory here). It's
really hard to find that much work to do on a byte in many cases. So despite
their massive compute power... GPUs are often I/O bound on most tasks. Crypto
tasks (hashing) are a notable exception.

Note that I/O bound here can happen _in different areas_ too. You can be I/O
bound over the GPU bus, or on the VRAM. It depends on where the piece of data
you need lives.

Also, in gaming it's typically _the CPU_ which is bound in draw-calls. The CPU
can't assemble the command lists fast enough to saturate the GPU (at least not
on a single thread). I'm not sure I'd say it's _necessarily_ I/O bound in this
case, either, but it's possible.

Normally GPUs will only use a fraction of their VRAM bandwidth while gaming -
however, like anything else, throwing superfluous resources at it _will_ still
produce a speedup even if it's not really "the KEY bottleneck". Having memory
calls return in fewer cycles will let your GPU get its shaders back to work
quicker, even though it means the memory may be idle for a greater percent of
time (i.e. utilization continues to fall).

Assuming you were going to do a DB where you store literally everything on the
GPU - I'm not sure whether it would be better to go columnar or by row. In a
columnar approach you would have one(or more) GPUs per column, and they would
either sequentially broadcast their resulting rowIDs and all perform set-
intersection on their own datasets, or dump their set of output rowIDs to a
single "master" GPU which would find the set intersection. In a row approach,
you would have each GPU find "candidate" row-IDs and then dump them to the
CPU, and the CPU handles broadcasting and intersecting (since hopefully it's
not a large set-intersection).

Columnar might be faster for supported queries, but row would be simpler to
implement and more flexible (since the critical data processing would more or
be taking place on the CPU).

I would have to play with it to be sure, and I don't have access to a cluster
of GPUs anymore (and never had access to one with NVLink or GPUDirect).

~~~
dogma1138
But the AST is effectively a "tiered system" lets say you have a 20TB of DB,
96GB fits in VRAM, 2 fits in your RAM, 17 and change are on your disks.

Today the GPU can see this as a single continuous memory space, you have it
does the AST and handles page faults when needed and tries to optimize the
execution of each task by prefetching the needed data to the VRAM by it self
through branch prediction or by programming the usage hints yourself.

Here is a short paper/write up on the unified memory in Pascal, look at the
difference between optimizing/profiling your memory usage and using the
prefetch hints, you going from increasing your data set by 4 times of max GPU
memory and getting performance reduced by about 3.8-4 times relying on page
faults alone to only losing less than 50% of your performance while going over
4 times your maximum available VRAM.

The same thing can be done on cards with 24GB, and the magic is that the
performance drop actually levels off at about 2-3 times your maximum available
VRAM which is why the difference between 2 and 4 times the amount of memory
isn't that big in terms of performance hit/gain to over allocation ratio, so 8
times your amount of memory is not that far behind the penalty you already pay
for using 4 times more.

And yes there is no scenario in which you do not lose performance, but the
unified memory solution in NVIDIA GPUs is a very good solution to reduce the
penalty and if you really need to go balls to the wall you'll actually benefit
from the scaling.

------
drej
I ran cloc to get a better idea of the tech in this. Interesting.

Edit: I noticed a lot of Go, but only one live script in the repo. So I
excluded the 'ThirdParty' folder, where some vendored libraries reside. Edit2:
That 90K SLOC single-file C goodness is a full SQLite release, which is not
vendored in the ThirdParty folder, taken it out as well.

    
    
      --------------------------------------------------------------------------------
      Language                      files          blank        comment           code
      --------------------------------------------------------------------------------
      C++                             127           4385           3960          54609
      C/C++ Header                    158           2960           4273          14301
      Java                             42           1599           1498           9539
      CUDA                              8            301             47           3125
      CMake                            36            377            967           2616
      yacc                              1            133            156            742
      Maven                             6              0            126            507
      Go                                1             72             12            491
      Bourne Shell                      5             72             43            369
      Bourne Again Shell                3             38              0            208
      lex                               1             28             12            204
      XML                               1             12             12            182
      Python                            5             65             25            168
      C#                                1             12             11            112
      JSON                              1              0              0             73
      SQL                               3              0              0             25
      --------------------------------------------------------------------------------
      SUM:                            399          10054          11142          87271
      --------------------------------------------------------------------------------

~~~
solidr53
Wow thats interesting, most of the languages other than C,C++, CUDA, Go and
Java must be build related things, right?

~~~
mattnewton
I assume they are bindings? But I haven't found an API doc that confirms that.

Edit: never mind, many of the 1-2 file languages are excluded from the latest
list, so they were in the "third party" folder.

------
wesm
I see they are supporting Apache Arrow
([http://arrow.apache.org/](http://arrow.apache.org/)) in their result set
converter, which will be nice for interoperability with other data system that
use Arrow: [https://github.com/mapd/mapd-
core/blob/21fc39fab9e1dc2c1682b...](https://github.com/mapd/mapd-
core/blob/21fc39fab9e1dc2c1682bcf3dcc2b49d5503aea6/QueryEngine/ResultSetConversion.cpp)

(I'm on the Arrow PMC)

~~~
tmostak
Yes its definitely alpha functionality now but we're very excited about this
piece and potentially supporting more of the Arrow spec. In case you missed it
check out the first project we are working on with Continuum and H2O with the
"GPU Data Frame" here:
[https://github.com/gpuopenanalytics/pygdf](https://github.com/gpuopenanalytics/pygdf).

------
Bedon292
Available on the AWS marketplace already set up:
[https://aws.amazon.com/marketplace/pp/B071H71L2Y](https://aws.amazon.com/marketplace/pp/B071H71L2Y)
Can test it for as little as $0.90 / hour on a p2.xlarge. I know I absolutely
cannot wait to play with it.

------
reacharavindh
This blog has a few performance numbers based on a NYC taxi rides dataset, and
some setup notes. Might be relevant for HNers here.
[http://tech.marksblogg.com/billion-nyc-taxi-rides-nvidia-
pas...](http://tech.marksblogg.com/billion-nyc-taxi-rides-nvidia-pascal-titan-
x-mapd.html)

I have nothing to do with the blog. Just sharing.

~~~
mmcclellan
This is where I first got interested in MapD. The author has a new post
detailing how to compile MapD: [http://tech.marksblogg.com/compiling-mapd-
ubuntu-16.html](http://tech.marksblogg.com/compiling-mapd-ubuntu-16.html)

------
rsp1984
This is quite cool. I'm curious though what the monetization strategy is now.
After all they're a for-profit company that has taken a lot of VC.

Sure, they could sell integration and maintenance services on top but services
are typically much harder to scale than software -- not quite what gets
investors excited.

~~~
shusson
My understanding is that the distributed[1] capability of MapD is not open
sourced. And I'm guessing a lot of the companies (especially the ones with
money to burn) that would use MapD require at least some horizontal scaling.

[1][http://docs.mapd.com/latest/mapd-core-
guide/distributed/](http://docs.mapd.com/latest/mapd-core-guide/distributed/)

------
wiradikusuma
For the laymen, is this something that I can use to replace MySQL/Postgre
(with ACID), assuming I only use ANSI SQL and no stored-proc?

~~~
remus
Probably not. MySQL and postgres are both very general purpose relational
databases that have a broad range of features aimed at a wide range of use
cases.

MapD is designed for more specific use cases, specifically analytic querying
and visualisation.

------
Meai
The demo
[https://www.mapd.com/demos/tweetmap/](https://www.mapd.com/demos/tweetmap/)
is pretty laggy, I think you should probably make it more performant so people
don't think it's the DB that is causing the problem.

~~~
tigershark
And why finland is "scrambled" with that big rectangle of noisy tweets with
uniform distribution even in the sea? O_o

~~~
tmostak
It's a bot.

------
latenightcoding
After reading a book on RDBMS design I had a similar idea, but after thinking
for a couple of minutes I just could't come up with a use case of a GPU-
powered database, it would be significantly slower in most scenarios.

~~~
_wmd
Just curious, which book did you read?

~~~
latenightcoding
I read a couple this is one of them:
[http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf](http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf)

------
polskibus
This is great stuff! I wonder how it makes IBM, Oracle and MS feel right now.

I can even see INSERTs allowed, wonder how it works with JOINs and if it is
ever going to support transactions.

------
red2awn
How does this compares to other GPU database like BlazingDB?

------
ape4
Is processing speed the bottleneck for databases?

~~~
remus
It depends what you're doing. I haven't looked in a while, but last time I did
the idea behind MapD was to produce interactive visualisations over lots of
data. That can certainly get pretty computationally expensive.

