Excellent work! I have a very small (marketing-related) nitpick:
Would you consider hosting your project on github instead of SourceForge? I don't say this to be trendy, github just has a much more pleasing interface in my opinion. It's also faster, subjectively at least.
Another practical consideration: I clicked on the "manual.txt" file and SF redirected me to a download page. I would have preferred to read the file in-place (like on github).
Our team has built an in-memory DB (hybrid, both column and row store available, also using advanced compression). We have investigated GPUs for scans, we found that while they can outperform CPUs, the bottleneck is the transfer of the data to the card. Since the card's memory is only in the GB's, and our databases are typically much larger this is kind of limiting, since you have to copy data back and forth.
Do you know what the maximum amount of GPU memory for a blade-like system currently is?
It is 6 GB for Tesla GPU.
PCI Express transfer speed is 5GB/s, with average compression rate averaging 4/1 the speed becomes 20GB/s so transfer speed is not a bottleneck anymore.
I see. So for a large dataset you would still hit a bottleneck, compared to a main mem store which also uses compression. But nonetheless it's a good idea to do certain operations on hot data on the GPU.
Which operations benefit the largest in your experience? I would guess scans with simple predicates and expressions. What about aggregations?
I had not heard of that data structure before, thanks!
Sadly, it appears to be patented and thus of no use to me. http://www.cs.umd.edu/~nick/projects/Dwarf.html I think it's really quite crass and mercantile for UMD to patent basic computer science algorithms and data structures like that.
Are you familiar with APL/K/Q/J ? These implement column store databases which a programming language significantly more expressive and usable. K and J are also significantly faster than any other SQL database -- it would be interesting to compare K's ultra-optimized in-memory processing to the GPU -- I'm not at all sure GPU is going to win here in practice.
My first thought looking at this was "Ooooh, it would be so nice if my GPU could parse Q.."
In fact, Q thoroughly sold me on tightly coupling a small expressive language with a parallel-friendly datastore. So many data problems require just this combination, and most general computing languages are surprisingly bad at it in comparison.
I'm not sure K's in-memory processing is incompatible with running on a GPU, either. I've been working on a fork of Kona that has a very different architecture (e.g. a bytecode VM, generational mark/sweep GC, lexical scope), and I plan on experimenting with CUDA/OpenCL once the fundamentals are set.
Oh, K's in memory processing model is surely compatible with modern GPUs - at last for most primitives.
The thing I'm not sure about is whether the transfer overhead (GPU, CPU, Memory) would still be worth it in the general case. There's a question of what you're actually comparing to - e.g. you can have a (relatively) naive implementation, like the first commit of Kona was. You can have Arthur's cache-optimized-but-still-pure-C-and-CPU implementation (I think the main kona branch is comparable these days, but I'm not sure), which is faster.
And then, you can compare it to an SSE2/SSE4 version, still on the CPU; such a version does not exist afaik, but is easier to code than you would expect if you use ispc (https://github.com/ispc/ispc/tree/master/examples) or cilk-plus -- simple things with no data dependencies, as many K primitives are, often get a 4x speedup with ISPC.
Oh, and by the way, thanks for your work on kona and this message -- I've only been following kevinlawler's branch, but I'll start following you too.
The transfer overhead would only need to happen once if there's a sequence of GPU-able primitives, though. The dataflow analysis for that isn't too hard.
I'll check ispc out too, thanks. My fork* is using some SSE, via GCC-specific extensions (http://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html). It could be special-cased into Kona as-is, but I've found a better way to implement scalars than the elaborate preprocessor hack I did there. (Besides, tuning the reference counting would probably have a bigger impact, something like Deutsch-Bobrow at least.)
* It's not on github, yet; I don't usually post things there until they're reasonably usable end-to-end. My kona fork is mostly for pull requests; I'm talking about a completely new implementation.
Would you like to continue this discussion off HN? My contact info is in my profile.
Very, very cool. I'm always happy to see more ways to utilize the tremendous performance/price of GPUs.
So, what are the pros and cons of this approach? For a given hardware and energy budget, what kinds of task would be better suited to this kind of GPU implementation? Conversely, what kind of tasks would still be faster to run on a Core i7?
I remember some time back when GPUs were first being looked at for HPC uses I suggested a client look into them, but they were quickly dismissed due to their floating point limitation (which may have changed).
But DBs don't have huge FP needs. Another place GPUs are making a big difference is video editing, utilizing CUDA or OpenCL.
GPUs have always had amazing floating point performance for single precision floats, that's basically all graphics is. The problem is that double precision support was traditionally non-existent, and has been slow until recently and it is still much slower than single precision float performance.
Double precision performance is now just half speed, which is the same as the CPU. Many applications are bandwidth bound, so running in double vs. single doesn't end up with a 2x difference anyway.
github would be better for browsing. But the source is only 117k - I'm surprised how much of a barrier a few clicks and a few seconds was for me. (Partly, I expected it to be bigger.) The 21 files total 18832 lines; there's bison, lex, and a bunch of cuda files. Some is structured as doing similar things for similar cases, so not that long. It seems to include a reasonably complete implementation of SQL in the bison etc (and there are two example files of SQL query sequences). I'm impressed.
There are some line comments, but no overviews. :(
That looks absolutely brilliant. So is the database all stored in the GPU memory? Or in main memory? Or does your engine handle storing to disk as well?
Pretty novel use of a GPU though, and if it's quick, it could be really useful.
Thank you !
The data are stored on a disk and only the columns necessary for transformations are brought to GPU. In case the data does not fit into a limited GPU memory, it is processed piece by piece. I tested it on terabyte sized files so the data does not have to fit into GPU memory .
Looks like a very interesting approach to me given the recent developments.
Still I have a few questions regarding your numbers. Can you specify how much data you actually stream from disk? Because when doing some back of the envelope calculations (8 columns out of a 1.8B rows assuming 4b each) you would need to stream ~300MB/s from disk. Which seems very unlikely for your setup (except you have a super fast SATA drive, or SSD).
Now, with only 4GB RAM, your are constantly filling your RAM and transporting data to GPU. Plus you need to compress the data somewhere (on GPU?).
Do I understand the manual.txt correctly, that you can only achieve the performance when the data is written compressed to disk before? (While sorting it?)
I may be wrong, but the group by and join looked like versions requiring sorted data.
Can you please give a little more details on that?
Here the details :
Integers and decimals are both stored as 8 bytes. So one record takes 42 bytes. After compression it takes about 8 bytes ( I use FOR - frame of reference) compression and it works really well with decimal numbers (check the range of decimal numbers in lineitem.tbl file - usually you encode them in just a few bits). So I can stream about 10 million records per second from my disk. Which makes it exactly 180 seconds for 1.8B rows ( CUDA calculations are done in the background process).
Records are decompressed in GPU and processed there.
Actually the main bottleneck is not the disk access but the latency of CUDA calls - each call takes about 15 ms and when you process the entire file in chunks of ,say, 6 million records, it really adds up.
Interesting. I wonder how a GPU could best be put to use in a database engine. Joins, I suppose? Would be cool to read more about this implementation, however the readme is a bit sparse.
Whether it's still relevant or not CUDA has a lot of mindshare in the academic community. Partially this is due to marketing and having their product out first, but also OpenCL has to go through some more abstractions. There was a quite popular forum post[1] a while ago now where someone claimed a 5x perf improvement from switching from OpenCL to CUDA and, while i havn't tested it myself, it's the sort of statistic that sticks in your mind.
Column store, compare it with MonetDB (open source) and similar. This design is suitable for read-mostly databases. It's a trade-off, faster reads and much better compression. But the price is writes usually become an order of magnitude slower. It isn't good for transactional work (OLTP).
Also the author talks (other comments) about transfer rate when the real issue with GPUS is latency. Most GPUs don't have decent RAM caches and it's very hard to cover the latency.
If you look up columnar databases, you can find a whole host of research on how these things store data. Storing values column by column to do vector processing is pretty common, but I can't recall seeing one built for a GPU before. Kudos to the dev for diving in to implementing one, it's pretty neat stuff.
Try finding research papers about MonetDB, Vertica/C-Store, or Vectorwise for some background. Or follow the links here:
Those are two different characters. They are not interchangeable. It looks like the guy's name is Anton, a Swedish name. Thus he will create names with "å"-s in them if he wants to.
Will this cause any problems? It might. That still does not mean that "a" is a proper substitute for "å".
Should he have tried making a name from just English letters? That might have been a good idea. But that is a different recommendation than the one you gave.
"STARGÅTE", on the other hand, is an example of gratuitous replacement of "A" by "Å".
Å and A are not at all the same letter. Although I agree that using characters outside of the ascii set for naming projects on the internet is probably a bad idea. If I had to translitterate Ålenkå to Latin chars, it would probably be Oolenko, but even that's not really that close. The Latin alphabet has a horribly limited set of vowels.
The transliteration of Å is usually "AA". In fact the "circle" above the A is really a lower case "a". It's just that as time passed, it "devolved" into a circle.
I always thought it was "oa". As in boat vs båt (the swedish word for boat). It is pronounced kinda like "oa", essentially a big round long "o" sound with the mouth muscles relaxing just before the sound is completed, making an OOO-a sound.
Yep, perfectly typable. Though I did have to look up the deadkey for ˚ as it's not a diacritic I use often (Option-K on the Mac OS X US Extended keymap; on Windows or Linux US International keymaps you can do it without a deadkey as Right-Alt W, or on the Mac US keymap as Option-A).
I highly recommend that people use and learn the "extended" or "international" keymaps on their system, so they can learn how to type these sorts of characters easily. As more and more systems are Unicode enabled, and you interact with more people internationally who would prefer their names to be spelled properly, it's a useful skill to have (and doesn't really take much work).
I wouldn't say it's related in any way to the alphabet in use. Take the English 'a' as an example: skate and car produce two different sounds out of the same letter.
Many alphabets do a good job transmitting sounds. Spanish, French or German are easy to read. English is particularly bad at having any correlation between spelling and pronunciation. "Spelling bee" would make little sense in most languages.
Sure, but F and E and completely different letters, even if they are almost identical (except for the little bar at the base of the E that is missing on the F).
A and Å are different letters, and can't be substituted for each other.
There are languages with true alphabets, ie. there is a one-to-one relation with written letters and sounds. A good example is Turkish. Once you learn the sounds of the 29 letters in the alphabet, you can perfectly pronounce all Turkish words.
Would you consider hosting your project on github instead of SourceForge? I don't say this to be trendy, github just has a much more pleasing interface in my opinion. It's also faster, subjectively at least.