
Alenka: GPU database engine - espeed
https://github.com/antonmks/Alenka
======
nrjdhsbsid
I keep hearing the promise of GPU databases but they don't seem to be terribly
useful for most real world workloads.

It reminds me of the big hoopla for GPU h264 encoders. When they came out
everyone realized the quality was worse and not much faster.

Some things don't lend themselves to parallel processing, notably anything
linear like transactions.

I mean yeah the GPU can sort a hundred billion items a second but how often do
you really need to sort that many items using a database? In 99.9% of uses you
have indexing or limits on the number of results.

Just saying, this program looks more like a stream processing platform with a
SQL-like frontend than a full database

~~~
arnon
You're thinking about transactional databases, and you're right. Transactional
databases will probably not benefit hugely from a GPU. That's not saying it's
impossible, but probably not worth the effort.

However, there are so many types of databases around. Lambda architectures are
all the rage now - you keep one database for your transactionals, and another
for analytics. Analytics are huge, in the multi-billions of dollars every year
and they've become one of the most important parts of steering a business and
deciding on new strategy. Larger businesses don't just 'go for it' anymore,
they analyze, and inspect, and dig deep into their historical data to find out
if something is worth doing.

GPUs tend to lend themselves well to analytics, contrary to transactions.
Specifically, columnar databases. When the columns are all of the same data
type, and the data locality is high, GPUs perform /very/ well.

Regarding your sorting point you may not really want to sort everything, you
got that bit right. But what if you want to perform a `JOIN` on a bunch of
data?

It makes more sense to sort it first, because the JOIN would be much faster -
matching keys would be much easier.

Now, if you were performing really fast SORT on a GPU, you're saving precious
processing time.

~~~
jbooth
Doesn't the overhead of moving things back and forth between GPU memory and
main memory wipe out most potential gains, though?

If you're running analytical workloads on big data sets, you're typically I/O
bound to start with. It seems like managing moving little pieces of it back
and forth to the GPU to compute is going to be a big PITA, add lots of little
latencies, and gain you absolutely nothing. What am I missing there?

~~~
arnon
1\. Not everything needs to be pushed up to the GPU. Some things are better
left in RAM.

2\. What if you only push indexes or similar up to the GPU, like an AB-tree
index? You're keeping all of the 'heavy' stuff down, and only uploading a
representation of it, to be later replaced with the actual data.

3\. Think compression/decompression done on the GPU directly.

------
kakoni
For postgres checkout [https://github.com/pg-
strom/devel](https://github.com/pg-strom/devel)

~~~
arnon
I don't think the development of pg-strom is still active, as far as I can
tell

------
marklit
If anyone is interested in getting this setup I put together some install
notes along with some speed bumps I encountered while benchmarking Alenka a
few months back: [http://tech.marksblogg.com/alenka-open-source-gpu-
database.h...](http://tech.marksblogg.com/alenka-open-source-gpu-
database.html)

------
general_ai
Calling this a "database" is a bit of an exaggeration.

~~~
keknaut
why?

~~~
rotten
We don't really call pandas a database either. It looks like a data processing
tool/library. "real" databases have integrated persistence models, as well as
discussions and design tradeoffs regarding ACID transactions and scalability.
There are query planners and indexes, constraints (type, value, foreign key)
and other logic that can help enforce business rules around the data. There
are triggers and embedded functions too.

Still, it is pretty cool technology and I think something that will be
integrated with a "real" database near you sometime soon.

~~~
paulmd
That's a matter of perspective. I don't think this is too different in concept
from, say, Bigtable, which is billed by Google as a database.

~~~
general_ai
BT wasn't actually billed as "database" internally, until they had to sell it
in Google Cloud where the definition of what constitutes a database is much
looser. Inside Google it's known as a multidimensional hash table.

~~~
paulmd
I would definitely agree that's a more precise definition. Constraining
"database" to only ever refer to RDBMS is what I really have a problem with.

But there really isn't a brightline between what's a table, and what's a
filesystem, and what's a database. You can put a blob in a database or
BigTable, and MongoDB is really close to being a flat file conceptually. You
can have a filesystem or database that is content-accessible. You can have a
filesystem that is atomic and supports rollbacks and can store relational data
like symlinks. A virtual filesystem like LVM can support schema-like volumes
on top of it.

At the end of the day it's all just technology that lets me abstract my writes
so I can deal with a more simplistic model backed by certain guarantees about
behavior. I want to write a program that does XYZ, not write a
filesystem/database driver. From there it's all just various tradeoffs.

------
vegabook
good to see open source activity in this space given the eye-watering prices
that mapd is charging.

[https://aws.amazon.com/marketplace/pp/B01M0ZY2OV?qid=1484735...](https://aws.amazon.com/marketplace/pp/B01M0ZY2OV?qid=1484735254624&sr=0-1&ref_=srh_res_product_title)

------
gcp
Should say CUDA instead of GPU.

~~~
MayeulC
That's true. Anyone knows how good are the CUDA -> OpenCL translators? Or has
an open CUDA compiler has been written (LLVM front end, likely)?

So, in a nutshell, is it possible to run CUDA on a non-nVidia GPU nowadays
(AMD or else)?

~~~
floatboth
[https://github.com/GPUOpen-ProfessionalCompute-
Tools/HIP/blo...](https://github.com/GPUOpen-ProfessionalCompute-
Tools/HIP/blob/f052f43b3b3d48ac79f2111c7da74c24f1ad29b2/docs/markdown/hip_porting_guide.md#porting-
a-new-cuda-project%22)

CUDA → HIP translator. HIP is an abstraction over CUDA and AMD's HCC:
[https://github.com/RadeonOpenCompute/hcc/wiki](https://github.com/RadeonOpenCompute/hcc/wiki)

------
m1sta_
Looks interesting. Need a lot more information.

------
chatman
License information missing.

~~~
MayeulC
Apache 2.0, going by the source files. Edit: though the "bison" files are
GPLv3+, son It might be a safer bet.

~~~
nrjdhsbsid
Wot? Somebody threw up this "database" without even checking the licenses? I'm
sure it works great

~~~
krona
fyi, the Bison GPL exception is in its documentation:
[https://www.gnu.org/software/bison/manual/html_node/Conditio...](https://www.gnu.org/software/bison/manual/html_node/Conditions.html)

The exception is littered throughout bison generated code: "This special
exception was added by the Free Software Foundation..."

~~~
nrjdhsbsid
Fair point, I'll leave my comment up so yours make sense. The downvotes should
be fun :)

