
Show HN: my database engine for GPU - antonmks
https://sourceforge.net/projects/alenka/files/
======
breckinloggins
Excellent work! I have a very small (marketing-related) nitpick:

Would you consider hosting your project on github instead of SourceForge? I
don't say this to be trendy, github just has a much more pleasing interface in
my opinion. It's also faster, subjectively at least.

~~~
astrodust
I second this. I'd rather be at the dentist than on a SourceForge page.

I swear it's the "myspace.com" of open source hosting.

------
mrich
Congratulations, well done!

Our team has built an in-memory DB (hybrid, both column and row store
available, also using advanced compression). We have investigated GPUs for
scans, we found that while they can outperform CPUs, the bottleneck is the
transfer of the data to the card. Since the card's memory is only in the GB's,
and our databases are typically much larger this is kind of limiting, since
you have to copy data back and forth.

Do you know what the maximum amount of GPU memory for a blade-like system
currently is?

~~~
antonmks
It is 6 GB for Tesla GPU. PCI Express transfer speed is 5GB/s, with average
compression rate averaging 4/1 the speed becomes 20GB/s so transfer speed is
not a bottleneck anymore.

~~~
mrich
I see. So for a large dataset you would still hit a bottleneck, compared to a
main mem store which also uses compression. But nonetheless it's a good idea
to do certain operations on hot data on the GPU.

Which operations benefit the largest in your experience? I would guess scans
with simple predicates and expressions. What about aggregations?

~~~
binarymax
I had investigated aggregates using gpu for building dwarf petacubes
(<http://www.cs.umd.edu/~nick/projects/Dwarf.pdf>), and yes it looks viable.

~~~
marshray
I had not heard of that data structure before, thanks!

Sadly, it appears to be patented and thus of no use to me.
<http://www.cs.umd.edu/~nick/projects/Dwarf.html> I think it's really quite
crass and mercantile for UMD to patent basic computer science algorithms and
data structures like that.

------
beagle3
Super cool, and thanks for sharing!

Are you familiar with APL/K/Q/J ? These implement column store databases which
a programming language significantly more expressive and usable. K and J are
also significantly faster than any other SQL database -- it would be
interesting to compare K's ultra-optimized in-memory processing to the GPU --
I'm not at all sure GPU is going to win here in practice.

See <https://github.com/kevinlawler/kona> for an open source K implementation
(the wiki has docs). Same language, newer version (with database integration)
and english language instead of ascii chars is called Q see
<http://kx.com/q/d/> (kdb1, q1 if you like longer, kdb, q if you like terser),
and <http://code.kx.com/wiki/Main_Page>

~~~
silentbicycle
I'm not sure K's in-memory processing is incompatible with running on a GPU,
either. I've been working on a fork of Kona that has a very different
architecture (e.g. a bytecode VM, generational mark/sweep GC, lexical scope),
and I plan on experimenting with CUDA/OpenCL once the fundamentals are set.

~~~
beagle3
Oh, K's in memory processing model is surely compatible with modern GPUs - at
last for most primitives.

The thing I'm not sure about is whether the transfer overhead (GPU, CPU,
Memory) would still be worth it in the general case. There's a question of
what you're actually comparing to - e.g. you can have a (relatively) naive
implementation, like the first commit of Kona was. You can have Arthur's
cache-optimized-but-still-pure-C-and-CPU implementation (I think the main kona
branch is comparable these days, but I'm not sure), which is faster.

And then, you can compare it to an SSE2/SSE4 version, still on the CPU; such a
version does not exist afaik, but is easier to code than you would expect if
you use ispc (<https://github.com/ispc/ispc/tree/master/examples>) or cilk-
plus -- simple things with no data dependencies, as many K primitives are,
often get a 4x speedup with ISPC.

Oh, and by the way, thanks for your work on kona and this message -- I've only
been following kevinlawler's branch, but I'll start following you too.

~~~
silentbicycle
The transfer overhead would only need to happen once if there's a sequence of
GPU-able primitives, though. The dataflow analysis for that isn't too hard.

I'll check ispc out too, thanks. My fork* is using some SSE, via GCC-specific
extensions (<http://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html>). It
could be special-cased into Kona as-is, but I've found a better way to
implement scalars than the elaborate preprocessor hack I did there. (Besides,
tuning the reference counting would probably have a bigger impact, something
like Deutsch-Bobrow at least.)

* It's not on github, yet; I don't usually post things there until they're reasonably usable end-to-end. My kona fork is mostly for pull requests; I'm talking about a completely new implementation.

Would you like to continue this discussion off HN? My contact info is in my
profile.

------
corysama
Very, very cool. I'm always happy to see more ways to utilize the tremendous
performance/price of GPUs.

So, what are the pros and cons of this approach? For a given hardware and
energy budget, what kinds of task would be better suited to this kind of GPU
implementation? Conversely, what kind of tasks would still be faster to run on
a Core i7?

~~~
liquidcool
Agreed, seems like it has a lot of potential.

I remember some time back when GPUs were first being looked at for HPC uses I
suggested a client look into them, but they were quickly dismissed due to
their floating point limitation (which may have changed).

But DBs don't have huge FP needs. Another place GPUs are making a big
difference is video editing, utilizing CUDA or OpenCL.

~~~
alexchamberlain
I think their floating point performance is improving.

~~~
Impossible
GPUs have always had amazing floating point performance for single precision
floats, that's basically all graphics is. The problem is that double precision
support was traditionally non-existent, and has been slow until recently and
it is still much slower than single precision float performance.

~~~
maximilian
Double precision performance is now just half speed, which is the same as the
CPU. Many applications are bandwidth bound, so running in double vs. single
doesn't end up with a 2x difference anyway.

------
joshu
now this is the kind of thing i come to HN for... but why oh why sourceforge?

~~~
6ren
github would be better for browsing. But the source is only 117k - I'm
surprised how much of a barrier a few clicks and a few seconds was for me.
(Partly, I expected it to be bigger.) The 21 files total 18832 lines; there's
bison, lex, and a bunch of cuda files. Some is structured as doing similar
things for similar cases, so not that long. It seems to include a reasonably
complete implementation of SQL in the bison etc (and there are two example
files of SQL query sequences). I'm impressed.

There are some line comments, but no overviews. :(

------
foobarbazetc
This is awesome. :)

Someone did something similar for PostgreSQL here:

<http://wiki.postgresql.org/wiki/PGStrom>

------
chubs
That looks absolutely brilliant. So is the database all stored in the GPU
memory? Or in main memory? Or does your engine handle storing to disk as well?

Pretty novel use of a GPU though, and if it's quick, it could be really
useful.

~~~
antonmks
Thank you ! The data are stored on a disk and only the columns necessary for
transformations are brought to GPU. In case the data does not fit into a
limited GPU memory, it is processed piece by piece. I tested it on terabyte
sized files so the data does not have to fit into GPU memory .

------
grundprinzip
Looks like a very interesting approach to me given the recent developments.

Still I have a few questions regarding your numbers. Can you specify how much
data you actually stream from disk? Because when doing some back of the
envelope calculations (8 columns out of a 1.8B rows assuming 4b each) you
would need to stream ~300MB/s from disk. Which seems very unlikely for your
setup (except you have a super fast SATA drive, or SSD).

Now, with only 4GB RAM, your are constantly filling your RAM and transporting
data to GPU. Plus you need to compress the data somewhere (on GPU?).

Do I understand the manual.txt correctly, that you can only achieve the
performance when the data is written compressed to disk before? (While sorting
it?)

I may be wrong, but the group by and join looked like versions requiring
sorted data.

Can you please give a little more details on that?

~~~
antonmks
Here the details : Integers and decimals are both stored as 8 bytes. So one
record takes 42 bytes. After compression it takes about 8 bytes ( I use FOR -
frame of reference) compression and it works really well with decimal numbers
(check the range of decimal numbers in lineitem.tbl file - usually you encode
them in just a few bits). So I can stream about 10 million records per second
from my disk. Which makes it exactly 180 seconds for 1.8B rows ( CUDA
calculations are done in the background process). Records are decompressed in
GPU and processed there. Actually the main bottleneck is not the disk access
but the latency of CUDA calls - each call takes about 15 ms and when you
process the entire file in chunks of ,say, 6 million records, it really adds
up.

------
ericflo
Interesting. I wonder how a GPU could best be put to use in a database engine.
Joins, I suppose? Would be cool to read more about this implementation,
however the readme is a bit sparse.

------
pavelkaroukin
antonmks, did you consider using OpenCL instead? If yes - why CUDA outweighed
it?

~~~
mappu
Whether it's still relevant or not CUDA has a lot of mindshare in the academic
community. Partially this is due to marketing and having their product out
first, but also OpenCL has to go through some more abstractions. There was a
quite popular forum post[1] a while ago now where someone claimed a 5x perf
improvement from switching from OpenCL to CUDA and, while i havn't tested it
myself, it's the sort of statistic that sticks in your mind.

______________________

1\. <http://forums.nvidia.com/index.php?showtopic=109684>

------
alecco
Column store, compare it with MonetDB (open source) and similar. This design
is suitable for read-mostly databases. It's a trade-off, faster reads and much
better compression. But the price is writes usually become an order of
magnitude slower. It isn't good for transactional work (OLTP).

<http://en.wikipedia.org/wiki/Column-oriented_DBMS>

Also the author talks (other comments) about transfer rate when the real issue
with GPUS is latency. Most GPUs don't have decent RAM caches and it's very
hard to cover the latency.

Also:
[http://reddit.com/r/programming/comments/oxq6a/a_database_en...](http://reddit.com/r/programming/comments/oxq6a/a_database_engine_that_runs_on_a_gpu_outperforms/c3kvts6)

------
toblender
Very nice. I'm glad people are using their GPU's for something more productive
than farming bitcoins.

------
2AM
nvidia gpus only, right? does it work with ati as well? CUDA is probably more
advanced, so it's no surprise.

------
jrmg
I love this kind of novel thinking. I'd definitely like to read more detailed
articles on how this works.

~~~
jdf
If you look up columnar databases, you can find a whole host of research on
how these things store data. Storing values column by column to do vector
processing is pretty common, but I can't recall seeing one built for a GPU
before. Kudos to the dev for diving in to implementing one, it's pretty neat
stuff.

Try finding research papers about MonetDB, Vertica/C-Store, or Vectorwise for
some background. Or follow the links here:

[http://www.dbms2.com/category/database-theory-
practice/colum...](http://www.dbms2.com/category/database-theory-
practice/columnar-database-management/)

~~~
cshesse
I was looking at this a number of weeks ago to see if it would be feasible to
speed up any SQL queries. There is a paper about this for SQLite:
[http://www.cs.virginia.edu/~skadron/Papers/bakkum_sqlite_gpg...](http://www.cs.virginia.edu/~skadron/Papers/bakkum_sqlite_gpgpu10.pdf)

------
TobbenTM
What is the point of using "å" instead of "a"?

~~~
Tuna-Fish
What's the point of using y instead of u?

Å and A are not at all the same letter. Although I agree that using characters
outside of the ascii set for naming projects on the internet is probably a bad
idea. If I had to translitterate Ålenkå to Latin chars, it would probably be
Oolenko, but even that's not really that close. The Latin alphabet has a
horribly limited set of vowels.

~~~
nakkiel
I wouldn't say it's related in any way to the alphabet in use. Take the
English 'a' as an example: skate and car produce two different sounds out of
the same letter.

Alphabets transmit words, much less so sounds.

~~~
ajuc
> Alphabets transmit words, much less so sounds.

In English, that is. Some languages using latin script always use the same
sound for the same letter (how it should be done, IMHO).

